Online Advertising#

In this project we implement the methods discussed in Stateless Learning Tasks (Multi-armed Bandits) for the online advertising example given there.

Of course, we have to simulate the environment to train the agent. For stateless reinforcement learning tasks the environment from the agent’s point of view looks like a fixed number of random number generators, each following a different probability distribution. Here we use the following functions to simulate an environment with 10 possible actions (ads to show):

import numpy as np 

rng = np.random.default_rng(0)

class Env:
    
    def __init__(self, stationary=False):

        self.stationary = stationary
        
        # initial probabilities for reward 1 (click rates), Bernoulli distribution
        self.p = np.array([0.2, 0.3, 0.1, 0.4, 0.5, 0.5, 0.7, 0.9, 0.8, 0.4])
        
        if not self.stationary:
            
            # drift in p for simulation of non-stationary environments
            self.delta_p = 0.001 * np.array([1, -1, 1, 0, 1, 1, 0, -1, -1, 1])

    def action(self, a):
        ''' Take action a (0-9) and return reward. '''

        if not self.stationary:
            self.p = np.clip(self.p + self.delta_p, 0.05, 0.95)

        # reward
        return rng.binomial(1, self.p[a])    

Sample Averaging#

Task: Write a function sample_averaging that implement the sample averaging method for stateless learning tasks with the \(\varepsilon\)-greedy policy. Arguments:

  • an Env object,

  • the value for \(\varepsilon\),

  • the update factor \(\alpha\) (where 0 indicates no weighting),

  • list of initial action values,

  • length of episode (number of steps to run).

Returns:

  • list of average reward after each step.

Create a stationary Env object, run an episode of 1000 steps, and plot average reward vs. step. Note that the area under the curve is the return obtained in the episode.

Solution:

# your solution

Stationary Problem with ε-Greedy Policy#

Task: Run episodes for \(\varepsilon\in\{0,0.01,0.1,0.5,1\}\) with 5000 steps. For each \(\varepsilon\) run 100 episodes and plot corresponding mean average rewards (vs. step). What do you see and why?

Solution:

# your solution
# your observations

Task: Plot average reward vs. step for 100 episodes with \(\varepsilon=0\) and, in another plot, for 100 episodes with \(\varepsilon=1\) (5000 steps each). Use thin lines to see more lines in the plots. What do you learn from the plots about the averaged values in the previous task?

Solution:

# your solution
# your observations

Stationary Problem with Optimistic Initial Values#

Task: Plot average reward vs. step for \(\varepsilon=0\) and optimistic initial values 0, 0.2, 0.4,…, 2. Use averages over 100 episodes as above.

What do you see and why?

Solution:

# your solution
# your observations

Task: Repeat the previous task with \(\alpha=0.5\).

Solution:

# your solution
# your observations

Non-Stationary Problem#

Task: Run 20 episodes with \(\varepsilon=0\), \(\alpha=0.5\) and 10000 steps in a non-stationary environment. Plot average reward vs. step for all episodes and explain what you see.

Solution:

# your solution
# your observations

Task: Run episodes for \(\varepsilon\in\{0,0.01,0.1,0.2,1\}\) with \(\alpha=0.5\) and 10000 steps. For each \(\varepsilon\) run 100 episodes and plot corresponding mean average rewards (vs. step). What do you see and why?

Solution:

# your solution
# your observations