NOTE: This part requires some basic understanding of calculus.
These are just my notes of the book Reinforcement Learning: An Introduction, all the credit for book goes to the authors and other contributors. Complete notes can be found here.
Notes for gradient bandit algorithm are given here. We are going to show here that the gradient bandit algorithm can be represented in form of stochastic gradient ascent (SGA). Note E[Rt]=∑iπt(a)q∗(a). The gradient bandit algorithm is given as
H_{t+1}(A_t) = H_t(A_t) + \alpha(R_t - \bar{R_t})(\dot{1}_{a = A_t} - \pi_t(a)) \tag{1}
where 1˙a=At is an operator whose the value is 1 if a=At else 0. The SGA can be given as (note + instead of − for SGD)
Ht+1=Ht+α∂Ht(a)∂E[Rt]
The problem here is that we do now know the value of optimal action q∗(a). Assuming it can be written as a SGA, we replace the value of E[Rt]
∂Ht(a)∂E[Rt]=∂Ht(a)∂[∑iπt(a)q∗(a)]
=∑iq∗(a)∂Ht(a)∂πt(a)
Now since πt(a) is a probability distribution (PD) there are two things to notice before going to the next step.
- ∑i∂Ht(a)∂πt(a)=0 as there might be changes in the probability distribution but the changes remain the same. Other way to look at it is that ∑iπt(a)=1 thus ∂∑iπt(a)=∑i∂πt(a)=0
- We can add another baseline Bt which is independent of Ht(a)
So we can modify the equation as follows:
∂Ht(a)∂E[Rt]=∑i(q∗(a)−Bt)∂Ht(a)∂πt(a)
Multiplying and dividing by the policy πt(x))
=∑iπt(x)πt(x)(q∗(a)−Bt)∂Ht(a)∂πt(a)
=∑iπt(x)[πt(x)(q∗(a)−Bt)×∂Ht(a)∂πt(a)]
Now this can be represented as the expectation function since there is a PD multiplied to it.
= \mathbb{E} \left[\frac{(q_*(a) - B_t)}{\pi_t(a)} \times \frac{\partial \pi_t(a)}{\partial H_t(a)}\right] \tag{2}
Now we shift to solving the differential ∂Ht(a)∂πt(a) which is a differential of type ∂g(x)∂f(x). This can be expanded as follows:
∂x∂(g(x)f(x))=g(x)2∂x∂f(x)g(x)−f(x)∂x∂g(x)
Putting the value of πt(a)=softmax(Ht(a)) in the equation gives
∂Ht(a)∂πt(a)=∂Ht(a)∂(∑y=1keHt(y)eHt(x))
This when reduced gives the result
∂Ht(a)∂πt(a)=(∑y=1keHt(y))21˙a=xeHt(x)∑y=1keHt(y)−eHt(y)eHt(x)
=∑y=1keHt(y)1˙a=xeHt(x)−(∑y=1keHt(y))2eHt(y)eHt(x)
= \pi_t(x) \left(\dot{1}_{a = x} - \pi_t(a) \right) \tag{3}
Putting the (3) in (2) gives
∂Ht(a)∂E[Rt]=E[πt(a)(q∗(a)−Bt)×πt(x)(1˙a=x−πt(a))]
=E[(q∗(a)−Bt)(1˙a=x−πt(a))]
= \mathbb{E}\left[(R_t - \bar{R_t})(\dot{1}_{a = x} - \pi_t(a))\right] \tag{4}
Any expectation function can be re-written in incremental updates and (4) is written as in (1). Thus we have shown that the gradient bandit algorithm can be written as SGA and it gives the same results.