Deep Reinforcement Learning and Control
-
Given a model of the world
- Markov Process
- Returns \(G_t\)
- Value Functions are Expected Return
- Computing the value function of a Markov Reward Process
- Solving MDPs
- We can find optimal policy by maximizing over value functions
- Solving Linear Equations
- MDP Policy Evaluation, Iterative Algorithm
- MDP Control
- MDP Policy Iteration(PI)
- Policy Improvement
- Monotonic Improvement in Policy
- Bellman Backup Operators
- Value Iteration (VI)
- Policy Iteration as Bellman Operations
- Contraction Operator
- Will Value Iteration Converge?
- Proof: Bellman Backup is a Contraction on \(V\) for \(\gamma < 1\)
- MDP: Computing Optimal Policy and Optimal Value
- Summary
- Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works
- Model Free Control
- Maximization Bias
- Value Function Approximation
- Value-Based Deep Reinforcement Learning
- Introduction to Policy Search
- Stochastic policies
- Policy optimization
- Appendix
- References
Given a model of the world
After this section, reader should be able to solve the problem of âMDP controlâ both in infinite horizon setting and finite horizon setting.
- Markov Processes
- MRPs
- MDPs
- Evaluation and Control in MDPs
Markov Process or Markov Chain \((S,P)\)
- Memoryless random process
- Sequence of random states with Markov property
- no rewards, no actions
Markov Reward Process is a Markov Chain + reward
\(R\) is a reward function\(R(s_t=s)=E[r_t\mid s_t=s]\)
We assume stationary rewards. \(r_i=r_j,\text{whenever }s_i=s_j, \forall i,j =0,1,...,\) In the case of stochastic rewards, we require that the cumulative distribution functions(cdf) of the rewards conditioned on the current state be time independent. \(F(r_i\mid s_i=s)=F(r_j\mid s_j=s), \forall s \in S,\forall i,j=0,1,...,\) where \(F(r_i\mid s_i=s)\) denotes the cdf of \(r_i\) conditioned on the state \(s_i=s\).
With these two equation above, we have the following result about the expeced rewards: \(R(s)=E[r_i\mid s_i=s],\forall i=0,1,...,\)
Horizon \(H\)
- Number of time steps in each episode
- Can be infinite
- Otherwise called finite Markov reward process
Markov Process
Two assumptions:
- Finite state space
- Stationary transition probabilities
\(P\) is a row-stochastic matrix. 1 is an eigenvalue of any row-stochastic matrix. Any eignvalue of a row-stochastic matrix has maximum absolute value 1.
The max-norm or infinity-norm of a vector \(x \in R^n\) is denoted by \(\Vert x\Vert_\infty\) and defined as \(\Vert x\Vert_\infty=\max_i \|x_i\|\). For any matrix \(A\in R^{m\times n}\), define the following quantity(called the induced infinity norm of a matrix) \(\Vert \boldsymbol{A}\Vert_\infty=\sup_{x \in R^n, x \neq 0} \frac{\Vert \boldsymbol{A}x\Vert_\infty}{\Vert x\Vert_\infty}\)
- \(\Vert \boldsymbol{A}\Vert_\infty\) satisfies all the properties of a norm
- \[\Vert \boldsymbol{A}\Vert_\infty = \max_{i=1,...,m}(\sum_{j=1}^n\|\boldsymbol{A}_{ij}\|)\]
- Conclude that if \(A\) is row-stochastic, the \(\Vert \boldsymbol{A}\Vert_\infty=1\)
- For every \(x\in R^n\), \(\Vert \boldsymbol{A}x\Vert_\infty \leq \Vert \boldsymbol{A}\Vert_\infty\Vert x\Vert_\infty\)
Returns \(G_t\)
General form: \(G_t=r_t+\gamma r_{t+1}+\gamma^2r_{t+2}+\dots+\gamma^{H-1}r_{t+H-1}\)
\(G_t=R_{t+1}+R_{t+2}+\dots R_{T}\) for epsiodic tasks, where T is a final step at which a terminal state is reached, ending an episode.
\(G_t=R_{t+1}+\gamma R_{t+2}+\dots =\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}\) for continuing tasks with a temporal discounting, where \(\gamma\) is the discount factor.
Discount factor
Consider a finite horizon Markov reward process, with bounded rewards. Specifically assume that \(\exists M \in (0,\infty)\) such that\(\mid r_i\mid \leq M \forall i\) and across all episodes(realizations). The return for any eepisode \(G_t\) is bounded. \(C(M,\gamma,t,H)=\frac{M(1-\gamma^{H-t})}{1-\gamma}\) is a bound. In other word, \(\mid G_t\mid \leq C\) for any episode.(Hint: Geometric progression)
Consider a finite horizon Markov reward process, with bounded rewards and \(\gamma <1\). The return for any eepisode \(G_t\) is bounded. Consider the partial sum \(\sum_{i=t}^N \gamma^{i-t}r_i\), for \(N\geq t\). \({S_n}_{N\geq t}\) is a Cauchy sequence.
Value Functions are Expected Return
\(V_{t}(s)=E[G_t\mid S_t=s]\) is called state-value function.
When the horizon H is infinite, \(V_{t}(s)=E[G_t\mid S_t=s]\) with the stationary assumptions of the rewards and tarnsition probabilities imply that \(V_i(s)=V_j(s)\) for all \(i,j=0,1,...,\) and thus in this case we will define: \(V(s)=V_0(s)\)
Computing the value function of a Markov Reward Process
- Simulation
- Analytic solution
- Iterative solution
Monte Carlo simulation
procedure Monte Carlo Evaluation(M,s,t,N)
- set \(i=0, G_t=0\)
- while \(i \neq N\) do
- Generate an episode, starting from state s and time t
- Using the generated episode, calculate return \(g \leftarrow \sum_{i=t}^{H-1}\gamma^{i-t}r_i\)
- \[G_t \leftarrow G_t+g\]
- \[i \leftarrow i + 1\]
- \[V_t(s) \leftarrow G_t/N\]
- return \(V_t(s)\)
Analytic solution
This method works only for an infinite horizon MRP with \(\gamma < 1\).
MRP value function satisifies \(V(s)=R(s)+\gamma \sum_{s' \in S}P(s'\mid s)V(s')\).
For finite state MRP(\(\|S\|<\infty\)), we can express \(V(s)\) using a matrix equation \(\begin{bmatrix}V(s_1)\\ \vdots \\ V(s_N)\end{bmatrix}=\begin{bmatrix}R(s_1)\\ \vdots \\ R(s_N)\end{bmatrix}+\gamma \begin{bmatrix}P(s_1\mid s_1) & \dots & P(s_N\mid s_1)\\ P(s_1\mid s_2) & \dots & P(s_N\mid s_2)\\ \vdots & \ddots & \vdots \\ P(s_1\mid s_N) & \dots & P(s_N\mid s_N)\end{bmatrix}\begin{bmatrix}V(s_1)\\ \vdots \\ V(s_N)\end{bmatrix}\) \begin{equation} V=R+\gamma PV \end{equation}
\begin{equation} V-\gamma PV=R\newline (I-\gamma P) V=R\newline V=(I-\gamma P)^{-1}R \end{equation}
Solving directly requires taking a matrix inverse \(\sim O(N^3)\). Note that \((I-\gamma P)\) is invertible.
Iterative Algorithm for Computing Value of a MRP
- Initialize \(V_0(s)=0\) for all \(s\)
- For \(k=1\) until convergence
- For all \(s \in S\)
- \[V_k(s)=R(s)+\gamma \sum_{s' \in S}P(s'\mid s)V_{k-1}(s')\]
- For all \(s \in S\)
Computational complexity: \(O(\|S\|^2)\) for each iteration(\(\|S\|=N\))
Solving MDPs
Markov Decision Process is Markov Reward Process + actions.
MDP is a tuple: \((S,A,P,R,\gamma)\)
Dynamic programming assumes full knowledge of the MDP. It is used for planning in an MDP.
- Prediction: Given an MDP(\(S,A,T,r,\gamma\)) and a policy \(\pi(a\mid s)=P[A_t=a\mid S_t=s]\) find the stae and action value functions.
- Optimal Control: Given an MDP(\(S,A,T,r,\gamma\)), find the optimal policy.
\[\pi_{*}(a|s)= \begin{cases} 1,& \text{if } a=argmax_{a\in A}q_{*}(s,a)\\ 0,& \text{otherwise} \end{cases}\] We can find optimal policy by maximizing over value functions
If we know \(q_*(s,a)\), we do not need the dynamics to find the optimal policy.
\[\pi_{*}(a|s)= \begin{cases} 1,& \text{if } a=argmax_{a\in A}(\sum_{s',r}p(s',r|s,a)(r+\gamma v_*(s')))\\ 0,& \text{otherwise} \end{cases}\]If we know \(v_*(s)\), we need the dynamics to do one step lookahead.
\[G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\dots \\ =R_{t+1}+\gamma (R_{t+2}+\gamma R_{t+3} + \gamma^2 R_{t+3}+\dots) \\ =R_{t+1}+\gamma G_{t+1}\] Solving Linear Equations
By conditioning on a state and taking expections: \(v_{\pi}(s)=E[R_{t+1}+\gamma v_{\pi}(S_{t+1})]\\ =\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')]\)
By conditioning on a state and action and taking expections: \(q_{\pi}(s, a)=E[R_{t+1}+\gamma q_{\pi}(S_{t+1}, A_{t+1})]\\ =\sum_{s',r}p(s',r|s,a)[r+\gamma \sum_{a'}\pi(a'|s) q_{\pi}(s',a')]\)
Bellman Expectation Equations
\begin{equation} v_{\pi}(s)=\sum_{a}\pi(a|s)\sum_{sâ,r}p(sâ,r|s,a)[r+\gamma v_{\pi}(sâ)] \end{equation} is a set of linear equations, one for each state. The state value function for \(\pi\) is its unique solution.
\begin{equation} q_{\pi}(s, a)=\sum_{sâ,r}p(sâ,r|s,a)[r+\gamma \sum_{aâ}\pi(aâ|s) q_{\pi}(sâ,aâ)] \end{equation} is a set of linear equations, one for each state and action. The state-action value function for \(\pi\) is its unique solution.
Bellman optimality Equations for
\(v_{*}(s)=\max_{a\in A}(\sum_{s',r}p(s',r \mid s,a)(r+\gamma v_{*}(s')))\)
\(v^*\) is the unique solution of this system of nonlinear equations
\[q_{*}(s, a)=E[R_{t+1}+\gamma max_{a' \in A}q_{*}(S_{t+1},a')\mid S_t=s, A_t=a]\\ = \sum_{s',r}p(s',r|s,a)[r+\gamma \max_{a'}q_{*}(s',a')]\]\(q^*\) is the unique solution of this system of nonlinear equations
\[v_{\pi}(s)=\sum_{a\in A}\pi(a\mid s)q_{\pi}(s,a) \\ v_{*}(s)=max_a q_{*}(s,a) \\ q_{*}(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma v_{*}(s')]\]MDP + Ď(a | s) = Markov Reward Process |
MDP is a MRP\((S,R^\pi, P^\pi,\gamma)\) where
\(R^\pi(s)=\sum_{a\in A}\pi(a\mid s)R(s,a)\) \(P^\pi(s'\mid s)=\sum_{a\in A}\pi(a\mid s)P(s'\mid s,a)\)
It implies we can use same techniques to evaluate the value of a policy for a MDP as we could to compute the value of a MRP, by defining a MRP with \(R^\pi\) and \(P^\pi\).
MDP Policy Evaluation, Iterative Algorithm
- Initialize \(V_0(s)=0\) for all \(s\)
- For \(k=1\) until convergence(e.g. \(\vert V_k^{\pi} - V_{k-1}^\pi\vert \lt \epsilon\))
- For all \(s \in S\)
- \[V_k^{\pi}(s)=\sum_{a}\pi(a\mid s)[r(s,a)+\gamma \sum_{s' \in S}p(s'\mid s,a)V_{k-1}^\pi(s')]\]
- For all \(s \in S\)
This is a Bellman backup for a particular policy.
If the policy is determinisitic then the above update simplifies to
\(V_k^{\pi}(s)=r(s,\pi(s))+\gamma \sum_{s' \in S}p(s'\mid s,\pi(s))V_{k-1}^\pi(s')\)
MDP Control
Compute the optimal policy \(\pi^*(s)=arg\max_{\pi}V^{\pi}(s)\)
There exists a unique optimal value function.
Optimal policy for a MDP in an infinite horizon problem(agent acts forever) is
- deterministic
- stationary (does not depend on time step)
- not necessarily unique
Number of deterministic policies is \(\|A\|^{\|S\|}\). Policy iteration is generally more efficient than enumeration.
MDP Policy Iteration(PI)
- Set \(i=0\)
- Initialize \(\pi_0(s)\) randomly for all states s
- While \(i == 0\) or \(\vert \pi_i - \pi_{i-1} \Vert_1>0\)(L1-norm, measures if the policy changed for any state):
- \(V^{\pi_i} \leftarrow \text{MDP } V\) function policy evaluation of \(\pi_i\)
- \(\pi_{i+1} \leftarrow\) Policy improvement
- i=i+1
Policy Improvement
Now, We get \(V^{\pi_i}\) by policy evaluation.
- Compute state-action vlaue of a policy \(\pi_i\)
- For \(s\in S\) and \(a\in A\)
- \[Q^{\pi_i}(s,a)=R(s,a)+\gamma \sum_{s'\in S}p(s'\mid s,a)V^{\pi_i}(s')\]
- For \(s\in S\) and \(a\in A\)
- Compute new policy \(\pi_{i+1}\), for all \(s\in S\) \(\pi_{i+1}(s)=arg\max_{a}Q^{\pi_i}(s,a) \forall s \in S\)
By definition \(arg\max_{a}Q^{\pi_i}(s,a) \geq Q^{\pi_i}(s,\pi_i(s))\)
\[\max_a Q^{\pi_i}(s,a) \geq R(s,\pi_i(s))+\gamma \sum_{s'\in S}p(s'\mid s,\pi_i(s))V^{\pi_i}(s')=V^{\pi_i}(s)\]The Werid thing is suppose we take \(\pi_{i+1}(s)\) for one action, then follow \(\pi_i\) forever, Our expected sum of rewards is at least as good as if we had always followed \(\pi_i\). Itâs a bit unclear since we completely change to the new policy meaning the new proposed policy is to always follow \(\pi_{i+1}\)
Monotonic Improvement in Policy
Definition:
\[V^{\pi_1} \ge V^{\pi_2}:V^{\pi_1}(s) \ge V^{\pi_2}(s), \forall s \in S\]Proposition:\(V^{\pi_{i+1}} \ge V^{\pi_i}\) with strict inequality if \(\pi_i\) is suboptimal, where \(\pi_{i+1}\) is the new policy we get from policy improvement on \(\pi_i\)
Proof:
\begin{equation} V^{\pi_i}(s) \le \max_a Q^{\pi_i}(s,a)\newline =\max_a R(s,a)+\gamma \sum_{sâ\in S}p(sâ\mid s,a)V^{\pi_i}(sâ)\newline =R(s,\pi_{i+1}(s))+\gamma \sum_{sâ\in S}p(sâ\mid s,\pi_{i+1}(s))V^{\pi_i}(sâ) \text{ by defn}\newline \le R(s,\pi_{i+1}(s))+\gamma \sum_{sâ\in S}p(sâ\mid s,\pi_{i+1}(s))\max_a Q^{\pi_i}(sâ,aâ) \text{ by defn}\newline = R(s,\pi_{i+1}(s))+\gamma \sum_{sâ\in S}p(sâ\mid s,\pi_{i+1}(s))(R(sâ,\pi_{i+1}(sâ))+\gamma \sum_{sââ\in S}p(sââ\mid sâ,\pi_{i+1}(sâ))V^{\pi_i}(sââ))\newline \dots\newline = V^{\pi_{i+1}}(s) \end{equation}
If policy does not change, it can never change again.
There is a maximum number of iterations of policy iteration.(\(\|A\|^{\|S\|}\))
Bellman Backup Operators
- Applied to a value function
- Returns a new value function
- Improves the value if possible
- \[BV(s)=\max_a [R(s,a)+\gamma \sum_{s'\in S}p(s'\mid s,a)V(s')]\]
- \(BV\) yields a value function over all states s
Value Iteration (VI)
- Set \(k=1\)
- Initialize \(V_0(s)=0\) for all states s
- Loop until convergence: (for ex. \(\vert V_{k+1}-V_{k}\Vert_\inf \le \epsilon\))
- For each state \(s\)
- \[V_{k+1}(s)=\max_a[R(s,a)+\gamma \sum_{s'\in S}P(s'\mid s, a)V_k(s')]\]
- View as Bellman backup on value function
- \[V_{k+1}=BV_k\]
- \[\pi_{k+1}(s)=arg\max_a[R(s,a)+\gamma \sum_{s'\in S}P(s'\mid s, a)V_k(s')]\]
- For each state \(s\)
Policy Iteration as Bellman Operations
Bellman backup operator \(B^\pi\) for a particular policy is defined as \(B^{\pi} V(s)=R^{\pi} (s)+\gamma \sum_{s' \in S}P^\pi(s'\mid s)V(s')\)
Policy evaluation amounts to computing the fixed point of \(B^{\pi}\)
To do policy evaluation, repeatedly apply operator until \(V\) stops changing \(V^\pi = B^\pi B^\pi \dots B^\pi V\)
To do policy improvement \(\pi_{k+1}(s)= arg\max_a[R(s,a)+\gamma \sum_{s'\in S}P(s'\mid s,a)V^{\pi_k}(s')]\)
Contraction Operator
Let \(O\) be an operator, and \(\|x\|\) denote any norm of x
If \(\|OV-OV'\| \le \|V-V'\|\), then \(O\) is a contraction operator
Will Value Iteration Converge?
Yes, if discount factor Îł < 1, or end up in a terminal state with probability 1.
Bellman backup is a contraction if discount factor, \(\gamma < 1\).
If apply it to two different value functions, distance between value functions shrinks after applying Bellman equation to each.
Proof: Bellman Backup is a Contraction on \(V\) for \(\gamma < 1\)
Let \(\vert V-V'\vert=\max_s \|V(s)-V'(s)\|\) be the infinity norm
\[\vert BV_k-BV_j\vert \le \max_a \vert\gamma \vert V_k-V_j\vert\sum_{s'\in S}P(s'\mid s, a)\vert =\gamma\vert V_k-V_j\vert\]Even if all inequalities are equalities, this is still a contraction if \(\gamma < 1\)
MDP: Computing Optimal Policy and Optimal Value
Policy iteration computes infinite horizon value of a policy and then improves that policy.
Value iteration maintains optimal value of starting in a state s if have a finite number of steps k left in the episode. Iterate to consider longer and longer episodes.
Consider a finite horizon MDP. A policy \(\pi\) in this MDP induces a value function\(V^{\pi}\)(lets refer to this as \(V^\pi_{old}\)). No suppose we have a new MDP where the only difference is that all rewards have a constant c added to them. Prove \(V_{new}^\pi(s)=V_{old}^\pi(s)+\frac{c}{1-\gamma}\)
\begin{equation} V^\pi_{old}=E_\pi [\sum_{T=0}^{\infty}\gamma^Tr_{t+T}\mid s_t=s]\newline V^\pi_{new}=E_\pi [\sum_{T=0}^{\infty}\gamma^T(r_{t+T}+c)\mid s_t=s]\newline =E_\pi [\sum_{T=0}^{\inf}\gamma^Tr_{t+T}\mid s_t=s]+c\sum_{T=0}^{\infty}\gamma^T\newline =V_{old}^\pi(s)+\frac{c}{1-\gamma} \end{equation}
Summary
Given dynamics/transition model \(P\) and reward model \(R\), we can do dynamic programing to evaluate how good a policy is.
\(V_k^\pi(s)\) is exact value of k-horizon value of state s under policy \(\pi\)
\(V_k^\pi(s)\) is an estimate of infinite horizon value of state s under policy \(\pi\)
Dynmaic programming computes this\(V^\pi(s)=r(s,\pi(s))+\gamma \sum P(s'\mid s,a)V_{k-1}^\pi(s')\), bootstrapping the rest of the expected return by the value estimate \(V_{k-1}\).
If we know model \(P(s' \mid s,a)\), we can compute immediate reward and expectation over next states exactly, then we can bootstrap and use current estimate \(V_{k-1}\)
Dynamic Programming:
- \[V^\pi(s)\approx E_\pi[r_t+\gamma V_{k-1}\mid s_t=s]\]
- Requires model of MDP \(M\)
- Bootstraps future return using value estimate
- Requires Markov assumption: bootstrapping regarless of history
Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works
- Estimating the expected return of a particular policy if donât have access to true MDP models
- Monte Carlo policy evaluation
- Temporal Difference (TD)
Monte Carlo(MC) Policy Evaluation
- \[V^\pi(s)=E_{T\sim \pi}[G_t\mid s_t=s]\]
- If trajectories are all finite, sample set of trajectories & average returns
- Does not require MDP dynamics/rewards
- Does not assume state is Markov
- Can only be applied to episodic MDPs
- Averaging over returns from a complete episode
- Requires each episode to terminate
In general, we get the Monte Carlo estimate of some quantity by observing many iterations of how that quantity is generated either in real life or via simulation and then averaging over the observed quantities. By the law of large numbers, this average converges to the expectation of the quantity.
- Aim: estimate \(V^\pi(s)\) given episodes generated under policy \(\pi\), which is the average of returns Gt under policy Ď starting at state s
- MC computes empirical mean return
- Often do this in an incremental fashion. After each episode, update estimate of \(V^\pi\)
- Execute a rollout of policy \(\pi\) until termination many times
- Record the returns \(G_t\) that we observe when starting at state s
- Take an average of the values we get for \(G_t\) to estimate \(V^\pi(s)\).
First-Visit Monte Carlo (MC) On Policy Evaluation
Initialize \(N(s)=0, G(s)=0 \forall s\in S\). \(N(s)\) means # times visited a state.
Loop
- Sample episode \(i=s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2}\ldots\)
- Define \(G_{i,t}=r_{i,t}+\gamma r_{i,t+1}+\gamma^2 r_{i,t+2}+\ldots+\gamma^{T_i-1} r_{i,T_i}\) as return from time step \(t\) onwards in \(i\)th episode.
- For each time step t till the end of the episode i
- If this is the first time t that state s is visited in episode i
- Increment counter of total first visits: \(N(s)=N(s)+1\)
- Increment total return\(G(s)=G(s)+G_{i,t}\)
- Update estimate \(V^\pi(s)=G(s)/N(s)\)
- If this is the first time t that state s is visited in episode i
Every-Visit Monte Carlo (MC) On Policy Evaluation
Initialize \(N(s)=0,G(s)=0\forall s \in S\)
Loop
- Sample episode \(i=s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2}\ldots,s_{i,T_i}\)
- Define \(G_{i,t}=r_{i,t}+\gamma r_{i,t+1}+\gamma^2 r_{i,t+2}+\ldots+\gamma^{T_i-1} r_{i,T_i}\) as return from time step \(t\) onwards in \(i\)th episode.
- For each time step t till the end of the episode i
- state s is the state visited at time step t in episodes i
- Increment counter of total visits: \(N(s)=N(s)+1\)
- Increment total return\(G(s)=G(s)+G_{i,t}\)
- Update estimate \(V^\pi(s)=G(s)/N(s)\)
Evaluation of the Quality of a Policy Estimation Approach: Bias, Variance and MSE
Consider a statistical model that is parameterized by θ and that determines a probability distribution over observed data P(x|θ)
Consider a statistic \(\hat{\theta}\) that provides an estimate of θ and is a function of observed data x
Definition: the bias of an estimator\(\hat{\theta}\) is: \(Bias_{\theta}(\hat{\theta})=E_{x\mid \theta}[\hat{\theta}(x)]-\theta(x)\)
Definition: the variance of an estimator\(\hat{\theta}\) is: \(Var(\hat{\theta})=E_{x\mid \theta}[(\hat{\theta}(x)-E[\theta(x)])^2]\)
Definition: mean squared error (MSE) of an estimator \(\hat{\theta}\) is: \(MSE(\hat{\theta})=Var(\hat{\theta})+Bias_{\theta}(\hat{\theta})^2\)
Let n be the number of data points x used to estimate the parameter θ and call the resulting estimate of θ using that data \(\hat{\theta}_n\)
Then the estimator \(\hat{\theta}_n\) is consistent if, for all Ďľ > 0 \(\lim_{n \rightarrow \infty} Pr(\|\hat{\theta}_n-\theta\|>\epsilon)=0\)
First-Visit Monte Carlo (MC) On Policy Evaluation
\(V^\pi\) estimator is unbiased and consistent
Every-Visit Monte Carlo (MC) On Policy Evaluation
\(V^\pi\) estimator is biased and consistent, but has better MSE
Temporal Difference Learning
- Combination of Monte Carlo & dynamic programming methods
- Model-free
- Can be used in episodic or infinite-horizon non-episodic settings
- Immediately updates estimate of V after each (s, a,r,sâ˛) tuple
Temporal Difference [TD(0)] Learning
\(V^\pi(s_t)=V^\pi(s_t)+\alpha([r_t+\gamma V^\pi(s_{t+1})-V^\pi(s_t)]])\)
TD error:\(\delta_t=r_t+\gamma V^\pi(s_{t+1})-V^\pi(s_t)\)
Algorithm
Input: \(\alpha\)
Initialize \(V^\pi(s) = 0, \forall s \in S\)
Loop
- Sampe tuple \((s_t, a_t, r_t, s_{t+1})\)
- \[V^\pi(s_t)=V^\pi(s_t)+\alpha([r_t+\gamma V^\pi(s_{t+1})-V^\pi(s_t)])\]
Model Free Control
Maximization Bias
Value Function Approximation
So far we have represented value function by a lookup table where each state has a corresponding entry, \(V(s)\), or each state-action pair has an entry, \(Q(s, a)\). However, this approach might not generalize well to problems with very large state and/or action spaces, or in other cases we might prefer quickly learning approximations over converged values of each state. A popular approach to address this problem is via function approximation:
\[v_\pi(s)\approxeq \hat{v}(s,\boldsymbol{w}) \text{ or } q_\pi(s,a)\approxeq \hat{q}(s,a,\boldsymbol{w})\]Here \(\boldsymbol{w}\) is usually referred to as the parameter or weights of our function approximator. We list a few popular choices for function approximators:
- Linear combinations of features
- Neural networks
- Decision trees
- Nearest neighbors
- Fourier / wavelet bases
Linear Feature Representations
In linear function representations, we use a feature vector to represent a state: \(x(s)=[x_1(s),x_2(s),\ldots,x_n(s)]\)
We then approximate our value functions using a linear combination of features: \(\hat{v}(s,\boldsymbol{w})=x(s)^T\boldsymbol{w}=\sum_{j=1}^n w_j x_j(s)\)
We define the (quadratic) objective (also known as the loss) function to be: \(J(\boldsymbol{w})=E_\pi[(v_\pi(s)-\hat{v}(s,\boldsymbol{w}))^2]\)
Gradient descent
A common technique to minimize the above objective function is called Gradient Descent.
Mathematically, this can be summarized as:
compute the gradient: \(\nabla_\boldsymbol{w}J(\boldsymbol{w})=[\frac{\partial J(\boldsymbol{w})}{\partial w_1},\frac{\partial J(\boldsymbol{w})}{\partial w_2}, \ldots, \frac{\partial J(\boldsymbol{w})}{\partial w_n}]\)
compute an update step using gradient descent: \(\Delta \boldsymbol{w} = -\frac{1}{2} \alpha \nabla_\boldsymbol{w}J(\boldsymbol{w})\)
take a step towards the local minimum \(\boldsymbol{w} = \boldsymbol{w} + \Delta \boldsymbol{w}\)
Stochastic gradient descent
In practice, Gradient Descent is not considered a sample efficient optimizer and stochastic gradient descent (SGD) is used more often. Although the original SGD algorithm referred to updating the weights using a single sample, due to the convenience of vectorization, people often refer to gradient descent on a minibatch of samples as SGD as well. In (minibatch) SGD, we sample a minibatch of past experiences, compute our objective function on that minibatch, and update our parameters using gradient descent on the minibatch.
Monte Carlo with linear VFA
- Initialize \(\boldsymbol{w}=0\), Returns(s)=0 \(\forall s\),k=1
- loop
- Sample k-th epsiode \((s_{k1}, a_{k1}, r_{k1}, s_{k2}, \ldots, s_k,L_k)\) given \(\pi\)
- for \(t=1,\ldots,L_k\) do
- if first visit to (s) in epsiode k then
- Append \(\sum_{j=t}^{L_k}r_{kj}\) to Return(\(s_t\))
- \[\boldsymbol{w} \leftarrow \boldsymbol{w}+\alpha(Reutrn(s_t)-\hat{v}(s_t,\boldsymbol{w}))x(s_t)\]
- if first visit to (s) in epsiode k then
- k=k+1
Temporal Difference (TD(0)) with linear VFA
Convergence Guarantees for Linear VFA for Policy Evaluation
Algorithm | Tabular | Linear VFA | Nonlinear VFA |
---|---|---|---|
Monte-Carlo Control | Yes | (Yes) | No |
SARSA | Yes | (Yes) | No |
Q-learning | Yes | No | No |
(Yes) means the result chatters around near-optimal value function.
Control using VFA
Neural Netowrks
Although linear VFAs often work well given the right set of features, it can also be diffcult to handcraft such feature set. Neural Networks provide a much richer function approximation class that is able to directly go from states without requiring an explicit specification of features.
Value-Based Deep Reinforcement Learning
- Deep Q-Network
- Double DQN
- Dueling DQN
All the three neural architectures are able to learn successful policies directly from high-dimensional inputs, e.g. preprocessed pixels from video games, by using end-to-end reinforcement learning, and they all achieved a level of performance that is comparable to a professional human games tester across a set of 49 names on Atari 2600.
Convolutional Neural Networks (CNNS) are used in these architectures for feature extraction from pixel inputs. Understanding the mechanisms behind feature extraction via CNNs can help better understand how DQN works.
Generalization: Deep Q-Network (DQN)
The performance of linear function approximators highly depends on the quality of features. In general, handcrafting an appropriate set of features can be diffcult and time-consuming. To scale up to making decisions in really large domains (e.g. huge state space) and enable automatic feature extraction, deep neural networks (DNNs) are used as function approximators.
DQN Architecture
Preprocessing Raw Pixels
The raw Atari 2600 frames are of size (210 Ă 160 Ă 3), where the last dimension is corresponding to the RGB channels. The preprocessing step adopted in (Mnih et al., 2015) aims at reducing the input dimensionality and dealing with some artifacts of the game emulator. We summarize the preprocessing as follows:
- single frame encoding: to encode a single frame, the maximum value for each pixel color value over the frame being encoded and the previous frame is returned. In other words, we return a pixel-wise max-pooling of the 2 consecutive raw pixel frames
- dimensionality reduction: extract the Y channel, also known as luminance, from the encoded RGB frame and rescale it to (84 Ă 84 Ă 1).
The above preprocessing is applied to the 4 most recent raw RGB frames and the encoded frames are stacked together to produce the input (of shape (84 Ă 84 Ă 4)) to the Q-Network. Stacking together the recent frames as game state is also a way to transform the game environment into a (almost) Markovian world.
Training Algorithm for DQN
Reducing Bias: Double Deep Q-Network (DDQN)
Decoupling Value and Advantage: Dueling Network
Introduction to Policy Search
Stochastic policies
Policy optimization
Policy objective functions
Once we have defined a policy \(\pi_(a|s)\), we need to able to measure how it is performing in order to optimize it. In an episodic environment, a natural measurement is the start value of the policy, which is the expected value of the start state:
\[J_1(\theta)=V^{\pi_\theta}(s_1)=E_{\pi_theta}[v_1]\]In continuing environments we can use the average value of the policy, where \(d^{\pi_\theta}(s)\) is the stationary distribution of \(\pi_\theta\):
\[J_{avV(\theta)}=\sum_s d^{\pi_\theta}(s)V^{\pi_\theta}(s)\]or alternatively we can use the average reward per time-step: \(J_{avR(\theta)}=\sum_s d^{\pi_\theta}(s)\sum_a \pi_\theta(a\mid s)R(s,a)\)
Gradient-free optimization methods
With an objective function, we can treat our policy-based reinforcement learning problem as an optimization problem. We focus on gradient descent, because recently that has been the most common optimization method for policy-based RL methods. However, it is worth considering some gradient-free optimization methods, including the following:
- Hill climbing
- Simplex / amoeba/Nelder Mead
- Genetic algorithms
- Cross-Entropy method (CEM)
- Covariance Matrix Adaptation (CMA)
- Evolution strategies
These methods have the advantage over gradient-based methods in that they do not have to compute a gradient of the objective function. This allows the policy parameterization to be non-differentiable, and these methods are also often easy to parallelize. Gradient-free methods are often a useful baseline to try, and sometimes they can work embarrassingly well.
However, this methods are usually not very sample efficient because they ignore the temporal structure of the rewards - updates only take into account the total reward over the entire episode, and they do not break up the reward into different rewards for each state in the trajectory.
Policy gradient
Let us define \(V^{\pi_\theta}=V(s_0, \theta)\) to be the objective function we wish to maximize over \(\theta\). Policy gradient methods search for a local maximum in \(V(\theta)\) by ascending the gradient of the policy, w.r.t parameters \(\theta\)
\[\Delta \theta = \alpha \nabla_\theta V(s_0, \theta)\]where \(\alpha\) is a step-size parameter and \(\nabla_\theta V(\theta)\) is the policy gradient.
\[\nabla_\theta V(\theta)=\begin{bmatrix}\frac{\partial V(\theta)}{\partial \theta_1}\\ \vdots \\ \frac{\partial V(\theta)}{\partial \theta_n}\end{bmatrix}\]Compute Gradients by Finite Differences
For each dimension \(k \in [1,n]\)
- Estimate kth partial derivative of objective function w.r.t. \(\theta\)
- By perturbing \(\theta\) by small amount \(\epsilon\) in kth dimension
\(\frac{\partial V(s_0, \theta)}{\partial \theta_k} \approxeq \frac{V(s_0,\theta+\epsilon u_k)-V(s_0,\theta)}{\epsilon}\) where \(u_k\) is a unit vector with 1 in kth component, 0 elsewhere.
- Use n evaluations to compute policy gradient in n dimensions
- Simple, noisy, inefficient, but sometimes effective
- Works for arbitrary polices, even if policy is not differentiale
- uses a number of evaluations that scales linearly with the policy feature dimension (not state)
Finite difference policy gradient:
- Is guaranteed to converage to a local optima
- Is not guaranteed to converage to a global optima
- Donât relies on the Markov assumption
COmputing the gradient analytically
- Assume policy \(\pi_\theta\) is differentiable whenever it is non-zero
- Assume we can calculate gradient \(\nabla_\theta \pi_\theta(s,a)\) analytically
- Many choices of differentiable policy classes including:
- Softmax
- Gaussian
- Neural networks
Appendix
Stochastic Matrices
A stochastic matrix is a square matrix of non-negative real numbers in a closed interval that list the probabilities in a finite Markov chain. This is also called a probability matrix, probability transition matrix, transition matrix, substitution matrix, or Markov matrix, is matrix used to characterize transitions for a finite Markov chain. These square matrices can take multiple forms, depending on which stochastic vector (row or column) can be summed to 1.
- A right stochastic matrix has each row summing to 1.
- A left stochastic matrix has each column summing to 1.
- A doubly stochastic matrix has both rows and columns summing to 1.
In the contex of Markov process, transition probability matrix P of size \(\mid S\mid\times \mid S\mid\), whose (i,j) entry is given by \(P_{ij}=P(j\mid i)\), with i,j referring to the states of S oredered arbitrarily. Matrix P is a non-negative row-stochastic matrix.
Let A be a stochastic matrix. Then:
- 1 is an eigenvalue of A.
- If \(\lambda\) is a (real or ccomplex) eigenvalue of A, the \(\mid \lambda \mid \leq 1\)
Proof:
\[A^T\begin{bmatrix}1\\1\\ \vdots\\1\end{bmatrix}=\begin{bmatrix}1\\1\\ \vdots\\1\end{bmatrix}\]Therefore, 1 is an eigenvalue of \(A^T\). But A and \(A^T\) have the same characteristic polynomial: \(det(A-\lambda I_n)=det((A-\lambda I_n)^T)det(A^T-\lambda I_n)\)
Therefore, 1 is an eigenvalue of A.
Now let \(\lambda\) be any eigenvalue of A, so it is also an eigenvalue of \(A^T\). Let \(x=(x_1,x_2,...,x_n)\) be an eigenvector of \(A^T\) with eigenvalue \(\lambda\). So \(A^T x = \lambda x\). The jth entry of this vector equation is \(\lambda x_j=\sum_{i=1}^n a_{ij}x_i\).
Choose \(x_j\) with the largest absolute value, so \(|x_i| \leq |x_j|\) for all i. Then \(|\lambda x_j|=|\sum_{i=1}^n a_{ij}x_i|\leq \sum_{i=1}^n a_{ij} \times |x_i|\leq \sum_{i=1}^n a_{ij} \times |x_j| = 1 \times |x_j|\)
where the last equality holds because \(\sum_{i=1}^n a_{ij} =1\). This implies \(\mid \lambda \mid \leq 1\).
Cauchy sequence
A sequence is called a Cauchy sequence if the terms of the sequence eventually all become arbitrarily close to one another. That is, given \(\epsilon > 0\) there exists N such that if \(m, n > N\) then \(\|a_m- a_n\| < \epsilon\).
- Any Cauchy sequence is bounded.
- Any convergent sequence is a Cauchy sequence.
- A Real Cauchy sequence is convergent.
Policy Gradient Theory
Let \(\theta\) denote the vector of policy parameters and \(\rho\) the performance of the curresponding policy. Then, in the policy gradient approach, the policy parameters are updated approximately proportional to the gradient: \(\delta \theta \approxeq \alpha \frac{\partial \rho}{\partial \theta}\)
where \(\alpha\) is a positive-definite step size.
With function approximation, two ways of formulating the agentâs objective are useful. One is the average reward formulation, in which polices are ranked according to their long-term expected reward per step, \(\rho(\pi)\): \(\rho(\pi)=\lim_{n \rightarrow \infty} \frac{1}{n}E\{r_1 + r_2 + \hdots + r_n \mid \pi\} = \sum_{s}d^{\pi}(s)\sum_a \pi(s,a)R_s^a\)
where \(d^{\pi}(s)=\lim_{t\rightarrow \infty}Pr\{s_t=s \mid s_0,\pi\}\) is the stationary distribution of states under \(\pi\), which we assume exists and is independent of \(s_0\) for all policies. In the average reward formulation, the value of a state-action pair given a policy is defined as \(Q^{\pi}(s,a)=\sum_{t=1}^{\infty}E\{r_t-\rho(\pi) \mid s_0=s,a_0=a,\pi\}\)
The second forumation is that in which there is a designated start state \(s_0\), and we care only about the long-term reward obtained from it. \(\rho(\pi)=E\{\sum_{t=1}^\infty \gamma^{t-1}r_t \mid s_0,\pi\}\)
\[Q^{\pi}(s,a)=E\{\sum_{k=1}^\infty \gamma^{k-1}r_{t+k} \mid s_t=s,a_t=a,\pi\}\]where \(\gamma \in [0,1]\) is a discount rate. In this formulation, we defined \(d^\pi(s)\) as a discounted weighting of states encountered starting at \(s_0\) and the following \(\pi\): \(d^\pi(s)=\sum_{t=0}^{\infty} \gamma^t Pr\{s_t=s\mid s_0,\pi\}\).
References
- Human-level control through deep reinforcement learningNature Feb 2015
Score Function
A score function is the derivative of the log of a parameterized probability/likelihood. For example, let \(p(s;\theta)\) be the probability of state s under parameter \(\theta\), then the score function would be \(\nabla_\theta \log{p(s;\theta)}\)