Skip to main content

5 posts tagged with "Multi Armed Bandit"

Posts about multi-armed bandit algorithms.

View All Tags

Multi-armed bandit - Upper Confidence Bound

· 6 min read

This post is part of my struggle with the book "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto. Other posts systematizing my knowledge and presenting the code I wrote can be found under the tag Sutton & Barto and in the repository dloranc/reinforcement-learning-an-introduction.


In multi-armed bandit we need exploration to find the best action because the value of each action is uncertain. The value of the action changes when we perform the action from time to time and learn about the reward we receive. The more often a given action is selected, the more certain we are that the value of this action is correct. So far, however, we have not taken this rather intuitive observation into account in our calculations. The actions were selected randomly, without taking into account whether the action values ​​were closest to the best one or how certain the estimates were.

Multi-armed bandit - optimistic initial values

· 3 min read

This post is part of my struggle with the book "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto. Other posts systematizing my knowledge and presenting the code I wrote can be found under the tag Sutton & Barto and in the repository dloranc/reinforcement-learning-an-introduction.


All the methods I have described so far depend on initial estimates of the value of Q1(a)Q_1(a). This is especially visible when we calculate MAB with ϵ=0\epsilon = 0, i.e. without exploration, still selecting the best possible action (arm). In statistics, we call such methods biased. The bias disappears for methods with α\alpha of 1n\frac{1}{n} when each action is selected at least once. For constant α\alpha, the bias does not disappear, it only decreases with time (subsequent iterations of the algorithm).

Multi-armed bandit - non-stationary version

· 3 min read

This post is part of my struggle with the book "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto. Other posts systematizing my knowledge and presenting the code I wrote can be found under the tag Sutton & Barto and in the repository dloranc/reinforcement-learning-an-introduction.


Non-stationary problem

In this post, I will discuss a particular type of multi-armed bandit (MAB) problem, which consists in the fact that for each one-armed bandit, the value of the rewards changes over time. This is the so-called non-stationary version of MAB. Until now, the value of rewards was obtained from a certain normal distribution with a certain mean and variance (the mean for each arm was selected randomly at the beginning in the constructor).

Multi-armed bandit - simple optimization

· 4 min read

This post is part of my struggle with the book "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto. Other posts systematizing my knowledge and presenting the code I wrote can be found under the tag Sutton & Barto and in the repository dloranc/reinforcement-learning-an-introduction.


In last post I discussed the basic version of multi-armed bandit with ϵ\epsilon-greedy strategy. The presented algorithm has a small drawback, as it requires recording each reward and calculating the arithmetic mean of the rewards for a given action each time the best action is selected. Not only does the algorithm require memory for rewards, as many times as there are time steps, but each time it is necessary to choose the best action, a lot of unnecessary and quite time-consuming calculations take place. Let's imagine that we have to calculate the arithmetic mean of one million prizes. How long will it take? This can be solved better.

Attack of multi-armed bandits

· 7 min read

This post is part of my struggle with the book "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto. Other posts systematizing my knowledge and presenting the code I wrote can be found under the tag Sutton & Barto and in the repository dloranc/reinforcement-learning-an-introduction.


Multi-armed bandit problem (or k-armed bandit problem) is one of the reinforcement learning problems, I don't know if it's the simplest one, but it allows for a relatively quick introduction to the subject and to become familiar with the basic concepts.