Multi-armed Bandits in Scala

Setting up the Environment

This post uses Almond in order to run Scala code in a Jupyter notebook. See my previous post to learn how to setup Jupyter, Ammonite and Almond. That post examined using the Scala libraries EvilPlot (including inline plotting in the Jupyter notebook) and Rainier for Bayesian inference in a simple linear model.

The imports required for this post are:

import coursier.MavenRepository

interp.repositories() ++= Seq(MavenRepository(

import $ivy.`com.stripe::rainier-core:0.2.2`
import $ivy.`com.stripe::rainier-plot:0.2.2`
import $ivy.`org.scalanlp::breeze:0.13.2`

import breeze.stats.distributions._
import breeze.linalg._
import almond.interpreter.api._

The full working notebook can be found here.

A Multi-armed Bandit

A multi-armed bandit is an analogy taken from the one-armed bandit slot machines where a lever is pulled and the player has an unknown probability of a prize. A multi-armed bandit is a generalisation, whereby the player is faced with multiple one-armed bandits each of which could have different rewards. The problem is to determine the best bandit to play. One way to determine this is to randomly pull levers to get information on the payout for each bandit. Assuming the probability of payout is constant in time, then after a period of exploration the player will be able to know which bandits pay the most.

Epsilon Greedy Method

One strategy to maximise the expected long-term reward from a bandit is to choose the bandit with the largest long-term reward a fraction of the time and the rest of the time choose a bandit uniformly at random in order to continually explore the space of actions. At each time step, the reward for a given action can be calculated and the long-term reward can be calculated as the function:

\[Q_{t+1}(A) = Q_t(A) + \frac{R_t(A) - Q_t(A)}{N_t(A)}\] where \(A\) is the current action, \(Q_t(A)\) is the long-term reward at time \(t\) for action \(A\), \(R_t(A)\) is the instantaneous reward for action \(A\) at time \(t\) and \(N_t(A)\) is the total number of times action \(A\) has been performed by time step \(t\). Before writing the algorithm for the epsilon greedy algorithm, first we define a few helper functions.

The first is to sample a value uniformly a random from a selection of values.

def sample(selection: Vector[Int]): Rand[Int] = {
    for {
        i <- Multinomial(DenseVector.ones[Double](selection.size))
    } yield selection(i)

If there are multiple actions with the same long-term reward then the next action should be selected randomly from all the actions which maximise the long term reward.

def maxActionWithTies(longTermReward: Vector[Double]): Rand[Int] = {
    val maxReward = longTermReward.max
    val rewards = longTermReward.zipWithIndex.
          filter { case (r, a) => r == maxReward }.map(_._2)
    if (rewards.size > 1) {
    } else {

Then a single step of the epsilon greedy algorithm can be written

case class BanditState(
    reward: Array[Double],
    longTermReward: Vector[Double],
    actions: Map[Int, Int]
def banditStep(
    epsilon: Double,
    reward: Int => Rand[Double],
    selectAction: (Map[Int, Int], Vector[Double]) => Rand[Int])(s: BanditState): Rand[BanditState] = {
    for {
        nextAction <- selectAction(s.actions, s.longTermReward)
        newReward <- reward(nextAction)
        prevCount = s.actions.get(nextAction).get
        nextCount  = prevCount + 1
        newLongTermReward = s.longTermReward(nextAction) + (newReward - s.longTermReward(nextAction)) / nextCount
    } yield BanditState(s.reward :+ newReward, 
                s.longTermReward.updated(nextAction, newLongTermReward),
                s.actions.updated(nextAction, nextCount))

Firstly, we define a BanditState which contains all of the rewards \(R_t(A)\) for each time step, a list of length equal to the number of actions containing the long-term reward for each action. actions represents \(N_t(A)\) using a map from the index of the action to the count of actions. The algorithm proceeds by sampling a uniform random number, if this number is less than the chosen value of epsilon, then a random action is sampled from a Multinomial distribution with equal probabilities, otherwise the algorithm selects the action which currently has the highest long-term reward. The values are updated according to the formula above.

To run this algorithm for a pre-determined number of steps, realise that it is recursive and completely determined by the count and long-term reward at the previous time step. Hence it can be implemented as a Markov chain.

def buildActions(actions: Int): Map[Int, Int] = {
    (0 until actions).map(a => a -> 0).toMap

def epsilonGreedy(
    epsilon: Double, 
    actions: Int, 
    reward: Int => Rand[Double],
    n: Int): BanditState = {
    val initState = BanditState(Array(0.0), Vector.fill(10)(0.0), buildActions(actions))
    MarkovChain(initState)(banditStep(epsilon, reward, selectGreedy(epsilon))).steps.drop(n-1).next

We can assume that the rewards for each of the ten actions is Normally distributed and define a suitable reward function

val qs = Gaussian(0, 1).sample(10)

// The reward is selected from a N(q(A_t), 1)
def r(qa: Seq[Double])(action: Int): Rand[Double] = 
    Gaussian(qa(action), 1)
//  qs: IndexedSeq[Double] = Vector(
//    -1.1319170735731177,
//    0.5392647196381599,
//    0.7127636875526561,
//    0.8765526115252499,
//    -0.9555744042626685,
//    -0.2723645491439034,
//    0.10029206857194808,
//    0.3758538986470721,
//    1.9412629812694995,
//    1.0620845496569054

Then the algorithm can be run for a single multi-armed bandit

val oneBandit = epsilonGreedy(0.5, 10, r(qs), 1000)

The distribution of actions at the end of 1,000 steps with epsilon = 0.5 and number of actions 10 is:

actions distribution

The mean reward at action 8 was highest at approximately 1.95, the epsilon-greedy algorithm prefers to take action 8.

Multiple Multi-armed Bandits

Now to see the behaviour of a typical multi-armed bandit with a constant reward function, calculate the average reward for n = 2,000 10-arm bandits each with 1,000 steps.

Vector.fill(2000)(DenseVector(bandit(0.1, 10, r(qs), 1000).reward)).
  reduce(_ + _).
  map(_ / 2000)

The average reward for each time step can then be plotted, this can be used to evaluate different choices of epsilon.

action_distribution <- read_csv(here::here("notebooks/data/action_distribution.csv"), 
                                col_names = c("action", "count"))
## Parsed with column specification:
## cols(
##   action = col_double(),
##   count = col_double()
## )
action_distribution %>% 
  ggplot(aes(x = action, y = count)) +

Different values of epsilon can be compared and the average reward can be calculated

val data = List(0.0, 0.1, 0.5).map ( eps => averageReward(2000, 1000, eps))
average_reward <- read_csv(here::here("notebooks/data/average_reward.csv"))
## Parsed with column specification:
## cols(
##   epsilon = col_double(),
##   step = col_double(),
##   reward = col_double()
## )
average_reward %>% 
  ggplot(aes(x = step, y = reward, colour = as.factor(epsilon))) +
  geom_line() +
  geom_label(data = filter(average_reward, step == 1000), 
            aes(label = epsilon, x = 950, y = reward), hjust = .5) +
  theme(legend.position = "none") +
  labs(title = "Average reward by step for various values of epsilon")