These are just my notes of the book Reinforcement Learning: An Introduction, all the credit for book goes to the authors and other contributors. My solutions to the exercises are on this page.
We do not know what the rules of the game are; all we are allowed to do is watch the playing. Of course if we are allowed to watch long enough we may catch on a few rules. - Richard Feynman
2.1 K-armed Bandit Problem
Assume that you are repeatedly faced with a choice among options, after each choice you receive a numerical reward from stationary probability distribution. Your objective is to maximise the total reward over some time period. So the value of some arbitrary action is given by . We denote the estimated value of action at timestep as which we would like make close to .
2.2 Action Value Methods
Since we need a moving understanding of the reward we calculate the estimates for it. One way to estimate is by averaging all the rewards received.
where means value is one for all the cases where . Once we have this to take any action we can use , which is to take a the action with highest value, this is also called a "greedy method". But this kind of selection prevents "exploration" of the values and gets stuck. To prevent this we use epsilon greedy method, where we take a random action with probability .
2.4 Incremental Implementation
Now we continue to build upon the previous equation to make it recursive like this. (Ideally we would like to have as many recursive algorithms as possible, this makes more sense computationally)
Which gives us a meta form of equation that we will use throughout in reinforcement learning, that is
Where is the which is reduced by taking a step of in towards the . Note that the This might look similar to the Gradient Descent update and later we will see how we can use gradients to get proper values. So we can give the bandit algorithm as follows:

Bandit Learning Algorithm
2.5 Tracking a non-stationary process
When the probability distribution is not static i.e. it does not change with time it is called stationary process, but most of the problems in reinforcement learning has non-stationary distribution. We can expand and calculate the values as follows, here is the step-size
Note that we call this weighted average because the average with or without the additional terms will remain the same, since . This type of distributions are called "exponential recency weighted average".
There are two important things for proper (remember this as we will need to use it later on):
2.6 Optimistic Initial Values
The methods yet discussed are dependent in some initial action-value estimate , or statistically speaking these methods are biased. Initial values can be changed to encourage exploration. Suppose but we set . Whichever actions are chosen, the reward is less than the starting estimates. The learner switches to the other actions after being "disappointed" with the rewards it is receiving. We call this trick Optimistic Initial Values. However it does not perform well in non-stationary problems as its exploration is temporary. If the task changes, creating renewed need for exploration, this method cannot help.
2.7 Upper Confidence Bound Action Selection
-greedy action selection forces the non-greedy solutions to the tried but there is not a discrimination between them. It would be better to select among non greedy action according to their potential of actually being optimistic.
Where is the number of times has been selected prior to . So if , then is considered to be maximising action. The idea of Upper Confidence Bound (UCB) action selection is that the square root term is the measure of uncertainty confidence level. Each time is selected the uncertainty is decreased.
It is more difficult to extend beyond bandits and move to general RL settings:
- For non-stationary distributions more complex methods would be needed
- When state spaces are huge, it will fail
2.8 Gradient Bandit Algorithm
Selecting action on estimates is not the only way we consider learning a preference for each action , denoted by . We call the preference probability distribution as policy and denote it by
Where . This is useful as the only thing required for any situation is the relative preference over the other possible actions. The equation for stochastic gradient ascent is given as:
Here is an operator whose the value is if else . is the average of all the rewards up through and including time . term serves as a baseline probability of taking in the future is increased and vice versa.
Additional Topic: Bandit Gradient as Stochastic Gradient Ascent
This is an additional topic where we show here that the gradient bandit algorithm can be represented in form of stochastic gradient ascent (SGA), is covered in detail in another page here.
2.9 Associative Search (Contextual Bandits)
Unlike till now where we simply had a k-bandit problem, what if there are several different k-bandit problems and at each step you you confront one of those chosen at random. Thus changing the bandit task to a step by step task, whose true action values are changed randomly. However in Practice it does not work well. They are like the full RL problem in that they involve learning a policy. But like k-armed bandit their reward is only at the immediate step. If its action affects the next situation and rewards it would become a full RL problem.