RL Chapter 4: Dynamic Programming
Solving Finite MDP. Notes from Barto Sutton.
Yash Bonde . 2019-08-25 . 8 min read
Reinforcement LearningNotes

These are just my notes of the book Reinforcement Learning: An Introduction, all the credit for book goes to the authors and other contributors. My solutions to the exercises are on this page. Additional concise notes for this chapter here. There is also a Deepmind's Youtube Lecture and slides. Some other slides.

We do not know what the rules of the game are; all we are allowed to do is watch the playing. Of course if we are allowed to watch long enough we may catch on a few rules. - Richard Feynman

Introduction

Dynamic Programming means optimising a "program" (which we call policy) over a temporal problem. It is a method used for solving complex problems by breaking them down into simpler smaller sub problems. Following are the requirements for DP:

  • Optimal Substructure: principle of optimality applies and we can find substructures
  • Overlapping problems: subproblems occur many times and solutions can be cached and reused
  • MDP satisfy both these properties and thus can be used with DP. Value functions already tell you what the value of the state is and therefore are similar to caches

Dynamic programming is used in other places as well scheduling algorithms, sequence alignment, shortest path, graphical problems, bioinformatics (lattice models), etc. Dynamic programming assumes full knowledge of MDP. They are used for two different things:

  • Prediction where we have to tell the value function vπv_\pi
  • Control where we have to tell the optimal value function vv_* and optimal policy π\pi_*

4.1 Policy Evaluation (Prediction)

Although DP ideas can be applied to problems with continuous state and action spaces, exact solutions are only possible in special cases. Recall from Chapter 3 we get value of state from following function.

vπ(s)=Eπ[GtSt=s]v_\pi(s) = \mathbb{E}_\pi \left[ G_t | S_t = s \right]

=Eπ[Rt+1+γGt+1St=s]= \mathbb{E}_\pi \left[R_{t+1} + \gamma G_{t+1} | S_t = s\right]

=Eπ[Rt+1+γvπ(St+1)St=s]= \mathbb{E}_\pi \left[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s\right]

= \sum_a \pi (a|s) \sum_{s',r} p(s',r|s, a) \left[r + \gamma v_\pi(s')\right] \tag{1}

Consider a sequence of approximate value functions v0,v1,v2,...v_0, v_1, v_2, ..., each mapping S+S^+ to R\mathbb{R}. The initial approximation, v0v_0, is chosen arbitrarily (except that the terminal, if any, must be given value 00), and each successive approximation is obtained by using the Bellman equation for vπ(0)v_\pi (0) as an update.

v_{k+1} = \sum_a \pi (a|s) \sum_{s',r} p(s',r|s, a) \left[r + \gamma v_k(s')\right] \tag{2}

Indeed it can be shown in general to converge to vπv_\pi as kk \rightarrow \infty under the same conditions that guarantees the existence of vπv_\pi. This algorithm is called iterative policy evaluation. All the updates done in DP algorithms are called expected updates because they are based on an expectation over all possible next states rather than on the sample next state. We think of the updates being done in a sweep through the state space (instead of using the two array method).

4.2 Policy Improvement

We know how good it is to follow the current policy π\pi from ss i.e. vπ(s)v_\pi(s), but would it be for better or worse to change to the new policy? One way to answer this is to consider selecting aa in ss and thereafter following the existing policy π\pi.

qπ(s,a)=s,rp(s,rs,a)[r+γvπ(s)]q_\pi(s,a) = \sum_{s',r}p(s',r|s,a) \left[r + \gamma v_\pi(s') \right]

If this is greater than the value vπ(s)v_\pi(s), we can assume that taking action aa everytime ss is encounters, would be a better policy. That this is true is special case called the policy improvement theorem. Let π\pi and π\pi' be any pair of deterministic policies such that sS\forall s \in S

v_\pi(s) \leq q_\pi(s, \pi'(s) \tag{1}

E[Rt+1+γvπ(St+1)St=s,At=π(s)]\leq \mathbb{E}\left[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s, A_t = \pi'(s)\right]

Eπ[Rt+1+γvπ(St+1)St=s]\leq \mathbb{E}_\pi'\left[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s\right]

Eπ[Rt+1+γqπ(St+1,π(St+1))St=s]\leq \mathbb{E}_\pi'\left[R_{t+1} + \gamma q_\pi(S_{t+1}, \pi'(S_{t+1})) | S_t = s\right]

Eπ[Rt+1+γRt+2+γ2vπ(St+2)St=s]\leq \mathbb{E}_\pi'\left[R_{t+1} + \gamma R_{t+2} + \gamma^2 v_\pi(S_{t+2}) | S_t = s\right]

Eπ[Rt+1+γRt+2+γ2Rt+3+...St=s]\leq \mathbb{E}_\pi'\left[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... | S_t = s\right]

Eπ[Rt+1+γvπ(St+1)St=s]\leq \mathbb{E}_\pi'\left[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s\right]

v_\pi(s) \leq v_\pi'(s) \tag{2}

That is if there is strict inequality of (1)(1) then there will be strict inequality of (2)(2). Thus is qπ(s,a)>vπ(s);a=π(s)q_\pi(s, a) > v_\pi(s); a = \pi'(s) then the changed policy is indeed a better than π\pi. Thus at each step we select the best possible qπ(s,a)q_\pi(s,a) at each step, basically greedy policy.

\pi'(s) = \max_a q_\pi(s,a) = \max_a \sum_{s',r}p(s',r|s,a)\left[r + \gamma v_\pi(s')\right] \tag{3}

By construction, the greedy policy meets the conditions of policy improvement theorem (1)(1), so we know that is as good as, or better than the original policy. Suppose the new policy vπ=vπv_{\pi'} = v_\pi then it follows that sS\forall s \in S, (3)(3) is true.

But this is same as Bellman optimality equation and therefore, vπv_{\pi'} must be vv_* and both π\pi and π\pi' must be optimal policy. Now there is a simple way to improve the policy, firstly Evaluate the policy π\pi as in vπv_\pi and secondly, improve the policy π\pi by acting greedily w.r.t. vπv_\pi

4.3 Policy Iteration

The idea of policy improvement is very simple as follows: if any policy π\pi has been improved using vπv_\pi to yield a better policy π\pi', we can then compute vπv_{\pi'} and improve it again to yield an even better π\pi^{''}. We can thus obtain a sequence of monotonically improving policies and value functions like:

E is evaluation and I is improvement

EE is evaluation and II is improvement

Example 4.2: Jack's Car Rental This is a detailed question where we have to calculate how many cars need to be transferred in either of two locations and cost of it, given that Jack gets profit for renting out the cars. This question can be broken down as follows, state is that any location can have a maximum of 20 cars. Actions are to move upto 5 cars overnight, reward is $10 for rented car and -$2 for moving a car overnight and transitions are poisson probability distribution.

Policy Iteration Algorithm

Policy Iteration Algorithm

Can we use something to truncate over value evaluation and stop at some point. We can look at the Bellman error and say how much we want to refine our error, if there is not a lot of difference then we can stop (ϵ\epsilon-convergence). Secondly simply stop after kk iterations of iterative policy evaluation. Extreme case of this would be stop after k=1k=1 iterations and this then becomes value iteration.

4.4 Value Iteration

If policy evaluation is done iteratively, then convergence exactly to vπv_\pi occurs only in the limit, should we wait for exact convergence, or can we stop early? In fact, the policy evaluation step id policy iteration can be truncated in several ways without losing guarantees of policy iteration. One important special case is when the policy evaluation is stopped after just one sweep (one update of each state).

vk+1=maxas,rp(s,rs,a)[r+γvk(s)]v_{k+1} = \max_a \sum_{s',r}p(s',r|s,a) \left[r + \gamma v_k(s') \right]

Faster convergence is often achieved by interposing multiple policy evaluation sweeps between each policy improvement.

Value Iteration Algorithm

Value Iteration Algorithm

4.5 Asynchronous Dynamic Programming

DP by default is not very fast and there are multiple improvements that can be made to make DP faster. Namely, in-Place, sweeping and real time. I am not going in detail for each and can be understood easily from book, lecture and slides.

4.6 Generalised Policy Iteration

The value function stabilizes only when it is consistent with the current policy, and the policy stabilizes only when it is greedy with respect to the current value function. Thus both the processes stabilize only when a policy has been found which is greedy with respect to its own evaluation functions. This implies that Bellman optimality equation holds and thus that the policy and value function are optimal. In either case, the two processes together achieve the overall goal of optimality even though neither is attempting to achieve it directly.

Generalised Policy Iteration

Generalised Policy Iteration

Extra: How video's nomenclature is related to book's

In the book equation for value iteration and in lectures it is given as follows:

vk+1=maxas,rp(s,rs,a)[r+γvk(s)]v_{k+1} = \max_a \sum_{s',r} p(s',r|s, a) \left[r + \gamma v_k(s')\right]

vk+1(s)=maxa(Rsa+γsPssavk(s))v_{k+1}(s) = \max_a \left( R_s^a + \gamma \sum_{s'} P_{ss'}^a v_k(s)\right)

We can derive this in the following manner

vk+1=maxas,rp(s,rs,a)[r+γvk(s)]v_{k+1} = \max_a \sum_{s',r} p(s',r|s, a) \left[r + \gamma v_k(s')\right]

=maxa(s,rp(s,rs,a)r+γs,rp(s,rs,a)vk(s))= \max_a \left(\sum_{s',r} p(s',r|s, a) r + \gamma \sum_{s',r} p(s',r|s, a) v_k(s') \right)

Now we can expand s,rp(s,rs,a)\sum_{s',r} p(s',r|s, a) like sp(ss,a)(rp(rs,a))\sum_{s'} p(s'|s, a) \cdot \left( \sum_r p(r|s,a) \right). Thus the above equation becomes as follows

=maxa(sp(ss,a)(rp(rs,a)r)+γrp(rs,a)(sp(ss,a)vk(s)))= \max_a \left(\sum_{s'} p(s'|s, a)\left( \sum_r p(r|s,a) r \right) + \gamma \sum_{r} p(r|s, a)\left( \sum_{s'} p(s'|s,a) v_k(s') \right) \right)

We can now combine rp(rs,a)r=Rsa\sum_r p(r|s,a) r = R_s^a and p(ss,a)=Pssap(s'|s,a) = P_{ss'}^a, thus we get the equation

=maxa(sp(ss,a)Rsa+γrp(rs,a)(sPssavk(s)))= \max_a \left(\sum_{s'} p(s'|s, a) R_s^a + \gamma \sum_{r} p(r|s, a)\left( \sum_{s'} P_{ss'}^a v_k(s') \right) \right)

As the inside elements are independent of the outer sum elements they can be brought outside the summation

=maxa(Rsasp(ss,a)+γ(sPssavk(s))rp(rs,a))= \max_a \left( R_s^a \sum_{s'} p(s'|s, a) + \gamma \left( \sum_{s'} P_{ss'}^a v_k(s') \right) \sum_{r} p(r|s, a) \right)

Now the value of summations is going to be 11 as they are sum of probabilities, we can write the above equation as

vk+1(s)=maxa(Rsa+γsPssavk(s))v_{k+1}(s) = \max_a \left( R_s^a + \gamma \sum_{s'} P_{ss'}^a v_k(s')\right)