RL Chapter 2 Additional Topics
Bandit Gradient Algorithm as Stochastic Gradient Ascent. Barto Sutton.
Yash Bonde . 2019-05-19 . 3 min read
Reinforcement LearningNotes

NOTE: This part requires some basic understanding of calculus.

These are just my notes of the book Reinforcement Learning: An Introduction, all the credit for book goes to the authors and other contributors. Complete notes can be found here.

Notes for gradient bandit algorithm are given here. We are going to show here that the gradient bandit algorithm can be represented in form of stochastic gradient ascent (SGA). Note E[Rt]=iπt(a)q(a)\mathbb{E}[R_t] = \sum_i \pi_t(a)q_*(a). The gradient bandit algorithm is given as

H_{t+1}(A_t) = H_t(A_t) + \alpha(R_t - \bar{R_t})(\dot{1}_{a = A_t} - \pi_t(a)) \tag{1}

where 1˙a=At\dot{1}_{a = A_t} is an operator whose the value is 11 if a=Ata = A_t else 00. The SGA can be given as (note ++ instead of - for SGD)

Ht+1=Ht+αE[Rt]Ht(a)H_{t+1} = H_t + \alpha \frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)}

The problem here is that we do now know the value of optimal action q(a)q_*(a). Assuming it can be written as a SGA, we replace the value of E[Rt]\mathbb{E}[R_t]

E[Rt]Ht(a)=[iπt(a)q(a)]Ht(a)\frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)} = \frac{\partial \left[\sum_i \pi_t(a)q_*(a)\right]}{\partial H_t(a)}

=iq(a)πt(a)Ht(a)= \sum_i q_*(a)\frac{\partial \pi_t(a)}{\partial H_t(a)}

Now since πt(a)\pi_t(a) is a probability distribution (PD) there are two things to notice before going to the next step.

  • iπt(a)Ht(a)=0\sum_i\frac{\partial \pi_t(a)}{\partial H_t(a)} = 0 as there might be changes in the probability distribution but the changes remain the same. Other way to look at it is that iπt(a)=1\sum_i\pi_t(a) = 1 thus iπt(a)=iπt(a)=0\partial \sum_i\pi_t(a) = \sum_i\partial\pi_t(a) = 0
  • We can add another baseline BtB_t which is independent of Ht(a)H_t(a)

So we can modify the equation as follows:

E[Rt]Ht(a)=i(q(a)Bt)πt(a)Ht(a)\frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)} = \sum_i (q_*(a) - B_t)\frac{\partial \pi_t(a)}{\partial H_t(a)}

Multiplying and dividing by the policy πt(x))\pi_t(x))

=iπt(x)πt(x)(q(a)Bt)πt(a)Ht(a)= \sum_i \frac{\pi_t(x)}{\pi_t(x)} (q_*(a) - B_t)\frac{\partial \pi_t(a)}{\partial H_t(a)}

=iπt(x)[(q(a)Bt)πt(x)×πt(a)Ht(a)]= \sum_i \pi_t(x) \left[\frac{(q_*(a) - B_t)}{\pi_t(x)} \times \frac{\partial \pi_t(a)}{\partial H_t(a)}\right]

Now this can be represented as the expectation function since there is a PD multiplied to it.

= \mathbb{E} \left[\frac{(q_*(a) - B_t)}{\pi_t(a)} \times \frac{\partial \pi_t(a)}{\partial H_t(a)}\right] \tag{2}

Now we shift to solving the differential πt(a)Ht(a)\frac{\partial \pi_t(a)}{\partial H_t(a)} which is a differential of type f(x)g(x)\frac{\partial f(x)}{\partial g(x)}. This can be expanded as follows:

x(f(x)g(x))=f(x)xg(x)f(x)g(x)xg(x)2\frac{\partial}{\partial x}\left( \frac{f(x)}{g(x)} \right) = \frac{\frac{\partial f(x)}{\partial x}g(x) - f(x)\frac{\partial g(x)}{\partial x}}{g(x)^2}

Putting the value of πt(a)=softmax(Ht(a))\pi_t(a) = softmax(H_t(a)) in the equation gives

Ht(a)πt(a)=Ht(a)(eHt(x)y=1keHt(y))\frac{\partial}{\partial H_t(a)} \pi_t(a) = \frac{\partial}{\partial H_t(a)} \left( \frac{e^{H_t(x)}}{\sum_{y=1}^{k} e^{H_t(y)}} \right)

This when reduced gives the result

Ht(a)πt(a)=1˙a=xeHt(x)y=1keHt(y)eHt(y)eHt(x)(y=1keHt(y))2\frac{\partial}{\partial H_t(a)} \pi_t(a) = \frac{\dot{1}_{a=x}e^{H_t(x)}\sum_{y=1}^{k} e^{H_t(y)} - e^{H_t(y)}e^{H_t(x)}} {{(\sum_{y=1}^{k} e^{H_t(y)}})^2}

=1˙a=xeHt(x)y=1keHt(y)eHt(y)eHt(x)(y=1keHt(y))2= \frac{\dot{1}_{a=x}e^{H_t(x)}} {{\sum_{y=1}^{k} e^{H_t(y)}}} - \frac{e^{H_t(y)}e^{H_t(x)}} {{(\sum_{y=1}^{k} e^{H_t(y)}})^2}

= \pi_t(x) \left(\dot{1}_{a = x} - \pi_t(a) \right) \tag{3}

Putting the (3)(3) in (2)(2) gives

E[Rt]Ht(a)=E[(q(a)Bt)πt(a)×πt(x)(1˙a=xπt(a))]\frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)} = \mathbb{E} \left[\frac{(q_*(a) - B_t)}{\pi_t(a)} \times \pi_t(x) \left(\dot{1}_{a = x} - \pi_t(a) \right) \right]

=E[(q(a)Bt)(1˙a=xπt(a))]= \mathbb{E}\left[(q_*(a) - B_t)(\dot{1}_{a = x} - \pi_t(a))\right]

= \mathbb{E}\left[(R_t - \bar{R_t})(\dot{1}_{a = x} - \pi_t(a))\right] \tag{4}

Any expectation function can be re-written in incremental updates and (4)(4) is written as in (1)(1). Thus we have shown that the gradient bandit algorithm can be written as SGA and it gives the same results.