RL Chapter 5 Exercises
Some solutions might be off. Notes from Barto Sutton.
Yash Bonde . 2020-04-11 . 3 min read
Reinforcement LearningNotes

NOTE: This part requires some basic understanding of calculus.

These are just my solutions of the book Reinforcement Learning: An Introduction, all the credit for book goes to the authors and other contributors. Complete notes can be found here. If there are any problems with the solutions or you have some ideas ping me at bonde.yash97@gmail.com.

Blackjack approximate state value 5.1 - 5.2

Blackjack approximate state value 5.1 - 5.2

5.1 Estimate Values

The value function jumps up for the last two rows in the rear because the value of the state at that moment is very high due to the sum of cards being in 202120-21 range. The dropoff occurs in the whole last row in the left because the dealer has shown an ace which can either be used as a 11 or 1111, thus it's confidence of victory is low. The frontmost values are higher in the upper diagrams as having a usable ace in the start is very beneficial.

5.2 Every-visit MC

With every visit MC the training cycles would be quicker as the average would be calculated over a larger reward base with discounts. But we are given that discount factor γ=1\gamma = 1 for this game and thus there will be no difference in the ETMC and FTMC in this game.

5.4 Improvements to MS-ES pseudo code

We need to add code for incremental averaging which gives us the following equation and it needs to be replaced in the pseudocode

Q(St,At)Q(St,At)+1N(St,At)(GQ(St,At))Q(S_t,A_t) \leftarrow Q(S_t, A_t) + \frac{1}{N(S_t, A_t)}(G - Q(S_t, A_t))

Also note that the given pseudo code is first visit MC.

5.5 EV-MC and FV-MC estimated value

Looking at the algorithm for first visit we can see that VFC=+1V_{FC} = +1 and breaking out a quick script for every-visit we get value +5.5+5.5.

quick code

quick code

5.6 Q(s,a)Q(s,a) Value

V(s)tJ(s)ρt:T(t)1GttJ(s)ρt:T(t)1V(s) \doteq \frac{\sum_{t\in J(s)}\rho_{t:T(t) - 1}G_t}{\sum_{t\in J(s)}\rho_{t:T(t) - 1}}

The only change is that instead of counting the state visit we count the state, action visit i.e. J(s)J(s,a)J(s) \rightarrow J(s,a)

Q(s,a)tJ(s,a)ρt:T(t)1GttJ(s,a)ρt:T(t)1Q(s,a) \doteq \frac{\sum_{t\in J(s,a)}\rho_{t:T(t) - 1}G_t}{\sum_{t\in J(s,a)}\rho_{t:T(t) - 1}}

5.7 MSE increases

In the plot for Example 5.4 the error over all decreases but weighted sampling shows a slight rise and then decreases, question is why does this happen?

5.8 Why the jump?

In the plot for Example 5.4 the weight sampling increases and then decreases. This can happen because the value of ρ\rho is large, i.e. the bias is high. Though as the episodes progress the average reduces and so does the bias.

5.9 Change algorithm for incremental implementation

The simple change is the removal of ReturnsReturns list and getting value of V(s)V(s) as

V(s)V(s)+1N(s)(GtV(s))V(s) \leftarrow V(s) + \frac{1}{N(s)}(G_t - V(s))

5.10 Derive weight update rule

There are a few things here that CnC_n represents the cumulative weight till now such that Cn+1=Cn+Wn+1C_{n+1} = C_n + W_{n+1}

Vn=k=1n1WkGkk=1n1WkV_n = \frac{\sum_{k=1}^{n-1}W_kG_k}{\sum_{k=1}^{n-1}W_k}

Vn+1=k=1nWkGkk=1nWkV_{n+1} = \frac{\sum_{k=1}^{n}W_kG_k}{\sum_{k=1}^{n}W_k}

=1Wn+k=1n1Wk(WnGn+k=1n1WkGk)= \frac{1}{W_n + \sum_{k=1}^{n-1}W_k}\big(W_nG_n + \sum_{k=1}^{n-1}W_kG_k \big)

=WnGnCn+k=1n1WkGkCn= \frac{W_nG_n}{C_n} + \frac{\sum_{k=1}^{n-1}W_kG_k}{C_n}

=WnGnCn+k=1nWkCnVn= \frac{W_nG_n}{C_n} + \frac{\sum_{k=1}^{n}W_k}{C_n}V_n

=WnGnCn+CnWnCnVn= \frac{W_nG_n}{C_n} + \frac{C_n - W_n}{C_n}V_n

Vn+1=Vn+WnCn(GnVn)V_{n+1} = V_n + \frac{W_n}{C_n}(G_n - V_n)

5.11 Why off-policy MC control removes π(as)\pi(a|s)

Since WW is only allowed to change if At=π(St)A_t = \pi(S_t) and π\pi being deterministic we can say that we are safe to π(AtSt)=1\pi(A_t|S_t) = 1.