Multi-armed bandit - Upper Confidence Bound
This post is part of my struggle with the book "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto. Other posts systematizing my knowledge and presenting the code I wrote can be found under the tag Sutton & Barto and in the repository dloranc/reinforcement-learning-an-introduction.
In multi-armed bandit we need exploration to find the best action because the value of each action is uncertain. The value of the action changes when we perform the action from time to time and learn about the reward we receive. The more often a given action is selected, the more certain we are that the value of this action is correct. So far, however, we have not taken this rather intuitive observation into account in our calculations. The actions were selected randomly, without taking into account whether the action values were closest to the best one or how certain the estimates were.
Let's recall how we chose the best action:
Which corresponds to these lines of code:
argmax = np.argmax(self.rewards)
return argmax
From the choose_action method:
def choose_action(self):
rand = np.random.uniform(0, 1)
if rand > self.epsilon:
# exploit
argmax = np.argmax(self.rewards)
return argmax
else:
# explore
return randint(0, self.arms - 1)
Let's use the formula below:
Where is the natural logarithm (i.e. to the base of the constant), and is the number of actions performed.
The square root part of the above formula measures the uncertainty in the estimate of the value of the stock .
The code looks like this:
def choose_action(self):
rand = np.random.uniform(0, 1)
if rand > self.epsilon:
# exploit
ucb = self.rewards + \
self.c * np.sqrt(np.log(self.t + 1) / (self.action_count + 1))
return np.argmax(ucb)
else:
# explore
return randint(0, self.arms - 1)
In the constructor I added c and t:
def __init__(self, arms, pulls, epsilon, c=0):
self.t = 0
self.c = c
c is a parameter that controls the degree of exploration.
I generated a table with possible values for some action: