Multi-armed bandit - simple optimization
This post is part of my struggle with the book "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto. Other posts systematizing my knowledge and presenting the code I wrote can be found under the tag Sutton & Barto and in the repository dloranc/reinforcement-learning-an-introduction.
In last post I discussed the basic version of multi-armed bandit with -greedy strategy. The presented algorithm has a small drawback, as it requires recording each reward and calculating the arithmetic mean of the rewards for a given action each time the best action is selected. Not only does the algorithm require memory for rewards, as many times as there are time steps, but each time it is necessary to choose the best action, a lot of unnecessary and quite time-consuming calculations take place. Let's imagine that we have to calculate the arithmetic mean of one million prizes. How long will it take? This can be solved better.
Optimization
Let's recall what the current code looks like:
def __init__(self, arms, pulls, epsilon):
# ...
self.rewards = [[] for _ in xrange(self.arms)]
def get_means(self):
means = np.zeros(self.arms)
for index, action_rewards in zip(range(len(means)), self.rewards):
if len(action_rewards) > 0:
means[index] = sum(action_rewards) / len(action_rewards)
return means
def save_reward(self, action, reward):
self.rewards[action].append(reward)
You can see that it counts every time. You can separate the 'means' variable here as a class field, and write new action values to it when a new reward arrives. However, then we are left with the matter of calculating the arithmetic mean itself, which is computationally and memory expensive if we were to calculate it for the new prize and all the old ones.
Ok, time for some math. Let's define the value of a stock as after it has been selected times:
And let's transform it into a better version: