Skip to main content

Multi-armed bandit - Upper Confidence Bound

· 6 min read

This post is part of my struggle with the book "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto. Other posts systematizing my knowledge and presenting the code I wrote can be found under the tag Sutton & Barto and in the repository dloranc/reinforcement-learning-an-introduction.


In multi-armed bandit we need exploration to find the best action because the value of each action is uncertain. The value of the action changes when we perform the action from time to time and learn about the reward we receive. The more often a given action is selected, the more certain we are that the value of this action is correct. So far, however, we have not taken this rather intuitive observation into account in our calculations. The actions were selected randomly, without taking into account whether the action values ​​were closest to the best one or how certain the estimates were.

Let's recall how we chose the best action:

Atarg maxaQt(a)A_t \doteq \underset{a}{\argmax}\> Q_t(a)

Which corresponds to these lines of code:

argmax = np.argmax(self.rewards)
return argmax

From the choose_action method:

def choose_action(self):
rand = np.random.uniform(0, 1)

if rand > self.epsilon:
# exploit
argmax = np.argmax(self.rewards)
return argmax
else:
# explore
return randint(0, self.arms - 1)

Let's use the formula below:

Atarg maxa[Qt(a)+clogtNt(a)]A_t \doteq \underset{a}{\argmax}\> \Bigg[Q_t(a) + c \sqrt{\frac{\log{t}}{N_t(a)}}\,\Bigg]

Where logt\log{t} is the natural logarithm (i.e. to the base of the ee constant), and Nt(a)N_t(a) is the number of aa actions performed.

The square root part of the above formula measures the uncertainty in the estimate of the value of the stock aa.

The code looks like this:

def choose_action(self):
rand = np.random.uniform(0, 1)

if rand > self.epsilon:
# exploit
ucb = self.rewards + \
self.c * np.sqrt(np.log(self.t + 1) / (self.action_count + 1))

return np.argmax(ucb)
else:
# explore
return randint(0, self.arms - 1)

In the constructor I added c and t:

def __init__(self, arms, pulls, epsilon, c=0):
self.t = 0
self.c = c

c is a parameter that controls the degree of exploration.

I generated a table with possible values ​​for some aa action:

ttNtN_tlogt\log{t}clogtNt,c=2c\sqrt{\frac{\log{t}}{N_t}}, c = 2
1100
210,30102999571,0973240099
310,47712125471,3814792864
410,60205999131,5518504971
510,69897000431,6720885196
620,77815125041,2475185372
720,845098041,3000754132
820,9030899871,3439419534
920,95424250941,3814792864
10211,4142135624
1131,04139268521,1783563044
1231,0791812461,1995450505
1331,11394335231,218711534
1431,14612803571,2361920216
1531,17609125911,2522466525
1641,20411998271,0973240099
1741,23044892141,1092560216
1841,25527250511,1203894435
1941,2787536011,13081988
2041,30102999571,1406270186
2151,32221929471,0284821028
2251,34242268081,036309869
2351,3617278361,0437347694
2451,38021124171,0507944582
2551,39794000871,0575216343
2661,4149733480,9712443386
2761,43136376420,9768533715
2861,44715803130,9822280901
2961,46239799790,9873864485
3061,47712125470,9923444478
3171,49136169380,9231504115
3271,50514997830,9274080558
3371,51851393990,9315161036
3471,5314789170,9354842648
3571,54406804440,939321349
3681,55630250080,8821288173
3781,56820172410,885494699
3881,57978359660,8887585714
3991,5910646070,8409160632
40101,60205999130,8005148322
41111,61278385670,7658112411
42121,62324929040,7355835077
43131,63346845560,70894688
44141,64345267650,6852429551
45151,65321251380,6639703836
46161,66275783170,6447398374
47171,67209785790,6272438044
48181,68124123740,6112357678
49191,690196080,59651551
50201,69897000430,5829185199
51211,70757017610,5703082168
52221,71600334360,5585701459
53231,72427586960,5476075824
54241,73239375980,5373381555
55251,74036268950,5276912263
56261,7481880270,5186058273
57271,75587485570,5100290269
58281,76342799360,501914619
59291,77085201160,4942220654

As can be seen from the values ​​in the table, with the passage of time tt, the uncertainty value in the root generally decreases, but if the action was not selected, the uncertainty increases slightly.

Summary

This method of selecting shares depending on uncertainty is abbreviated as UCB (upper confidence bound). This is a statistical method related to confidence intervals, or at least that's how I understand it. I barely remember anything about statistics, I was never good at it. UCB is quite a good method, but Sutton & Barto warn that it does not work well in non-stationary problems or in problems in which we are dealing with a large state space.

The code can be seen here.

Finally, some charts:

average rewards

Pretty good, on average UCB is better than the version without it in terms of average rewards. This jump and drop at the beginning of the algorithm is interesting.

It's worse with optimal actions: optimal actions

No wonder, exploration happens more often.

For one MAB run: one MAB run