Stateless Learning Tasks (Multi-armed Bandits)#

For introducing and demonstrating important concepts and ideas of reinforcement learning we start with a simplified setting here. We assume that the environment always has the same state from the agent’s point of view. Of course, this one state does not matter for action selection of the agent. Thus, from the agent’s point of view the task is stateless.

scheme of information flow between agent and environment

Fig. 77 The agent choses an action according to its policy. Then the environment sends a reward signal to the agent.#

Related projects:

Online Advertising

The Detailed View#

Although the definition of stateless tasks has already been given in the introductory paragraph, we should be more clear about the consequences. Each action of the agent yields a reward (but no new state). If the agent knows $k$ actions, then there will be at most $k$ different rewards.

But take care, reward may be different each time even if the agent always chooses one and the same action. An agent seeing always the same state does NOT imply that the environment does not change. Maybe the agent lacks relevant sensors. Thus, rewards have to be regarded as random numbers with distributions depending on the action chosen. In other words, the sequence of rewards for stateless tasks behaves like a sequence of random numbers drawn from $k$ distributions and distributions are chosen by the agent’s action.

Such settings are known as multi-armed bandits. The $k$ actions correspond to the $k$ levers of $k$ one-armed bandits. Rewards correspond to the total coin output of the bandit machines after pulling one of the levers. Each one-armed bandit may have a different probability distribution for yielding coins. Some time instead of multi-armed bandit the term k-armed bandit is used.

photo of four one-armed bandits in a row

Fig. 78 A real-world four-armed bandit made of four one-armed bandits. Source: Wikipedia (public domain), edited by the author.#

Stationarity#

We already mentioned, that although the agent always sees the same state, the environment may change as a result of the agent’s actions. But environment may change due to other influences, too, especially over time. If reward distributions are fixed, that is, they do not change over time, then the learning task is stationary. If reward distributions may change over time, then the task is non-stationary.

Stationary task only need exploration until reward distributions are known in sufficient detail to the agent. Then gathered knowledge can be exploited without ony further exploration. Further exploration would would decrease return (long-term accumulated rewards) without any use.

For non-stationary task exploration should never stop. Else the agent’s behavior cannot adapt to changing reward distributions. Without exploration return will become lower and lower over time.

Example: Online Advertising#

Let’s consider a simple online advertising model for demonstrating first methods for stateless reinforcement learning tasks. A webpage shows an ad everytime some user loads the page. The ad to show is chosen from a pool of $k$ ads. If the ad gets clicked by the user, the webpage provider earns some money. Of course, the website provider wants to show ads being clicked more often than others. But how to identify successful ads without prior knowledge on user behavior.

The setting is as follows:

actions: show one of $k$ available ads (so we have $k$ actions to choose from),
reward: 1 if ad is clicked, 0 else.

To make things precise (and stateless) we have to impose some restrictions:

We do not collect user-specific information and we do not care about webpage content. These two assumptions yield a stateless environment.
There are only few users. Thus, the next visits the website not before the previous user has decided whether to click the ad or not.
Users have independent click behavior (they do not talk to each other).

Click behavior for each ad may be represented by an unknown probability value whether the ad gets clicked or not. The $k$ values may be constant over time (stationary task) or they may change due to external factors (non-stationary task). Here ‘external’ means that those factors are not included in the state information provided by the environment (in stateless tasks no state information is provided). Typical external factors may be seasonal effects, for instance.

There exist several other real world examples for stateless reinforcement learning tasks. An important one is routing in computer networks. Data packages try to find the fastest or the most reliable route to the destination machine. But travel times via different routes are unknown in advance and may change over time. Whether transmission will be successful or not is unknown, too. The routing software has to learn properties (average speed, reliability) of available routes over time.

Solution without RI (A/B Testing)#

A standard approach for solving stateless learning tasks is A/B testing. It’s a two phase approach:

In phase 1 all actions are chosen equally often and average rewards for each action are calculated. In this phase we have 100% exploration, but no exploitation.
In phase 2 the action with highest reward in phase 1 is chosen all the time. Now we have 100% exploitation, but no further exploration.

This approach has two important drawbacks:

In phase 1 actions with low average reward are shown too often. That’s a problem especially of long exploration phases, because return (accumulated rewards) will be significantly lower than with reinforcement learning methods.
If average rewards change over time (non-stationary tasks), either the model is not able to adapt to the new sitatuation or another exploration phase is required. Both variants will decrease the return. Without adaption decrease will be smaller in the short-run, but higher in the long-run. With new exploration phase decrease of return will be high in the short-run, but low (maybe no decrease at all) in the long-run.

Let’s consider the everyday example of finding a route from your home to some new destination. If you use A/B testing to find the fastest route, you take every available route (including the slow ones!) several times. Then you decide for the fastest route for the rest of your life. Even if there appear new and faster routes or if your chosen route becomes slower (more traffic,…), you won’t readjust your decision. That’s not what humans really do.

Solution with RI (Sample Averaging)#

To solve stateless learning tasks with reinforcement learning methods we have to formalize things a bit more. First, we restrict our attention to stationary tasks. Then we adapt the introduced method to non-stationary tasks.

Action Values#

As already mentioned in Maximization of Return we consider value function methods only. Thus, we have to define a state-value function or an action-value function. Of course, state values are useless in a stateless task (actions do not alter that). Thus, we use action values.

Action values should express return expected from choosing corresponding actions. If return is simply the sum of all rewards over time and if we have a reward function taking values 0 and 1 only like in the online advertising example above, then for continuing tasks return will be infinite. For episodic tasks return will be the product of the number of steps in the episode and average return. Thus, in both cases average reward is a sensible action value:

\[\begin{equation*} q^\ast:A\rightarrow\mathbb{R},\qquad q^\ast(a):=\text{expected reward for action $a$}, \end{equation*}\]

where $A$ denotes the set of all available actions. Note that by ‘expected reward’ we denote the theoretical average reward, that is, the mean of the underlying reward distribution. By ‘average reward’ we denote the observed average, that is, the empirical mean of the reward distribution.

If we would know $q^\ast$, we could simply choose the action with highest $q^\ast(a)$ (that is, we could follow the greedy policy with respect to $q^\ast$). But $q^\ast$ is unknown and has to be estimated from interacting with the environment.

In A/B testing we would follow the random choice policy in phase 1 (exploration) to get estimates for all $q^\ast(a)$. Then, in phase 2 (exploitation), we would follow the greedy policy w.r.t. $q^\ast$.

In reinforcement learning we use the $\varepsilon$-greedy policy w.r.t. some estimate of $q^\ast$ and improve the estimate in each step.

Estimating Action Values#

Assume that we are at time step $t\in\mathbb{N}$, that is, we’ve already chosen $t-1$ actions $a_1,\ldots,a_{t-1}$ and we have observed corresponding $t-1$ rewards $r_1,\ldots,r_{t-1}$. Then a straightforward estimate for the inaccessible action-value function $q^\ast$ at time $t$ is

\[\begin{equation*} Q_t:A\rightarrow\mathbb{R},\qquad Q_t(a):=\frac{\text{sum of rewards obtained from choosing $a$ in first $t-1$ steps}}{\text{number of times $a$ has been chosen in first $t-1$ steps}}, \end{equation*}\]

where we set $\frac{0}{0}:=0$. This is the empirical mean reward for each action.

Note that setting initial estimates $Q_1(a)$ to zero for all actions $a$ is somewhat arbitrary. Below we’ll discuss the influence of initial values on the agent’s behavior and how to choose appropriate ones.

If we combine this value function estimate with the $\varepsilon$-greedy policy, we obtain the following algorithm for solving stateless learning tasks:

Algorithm: sample averaging

Choose $\varepsilon\in(0,1)$.
For $t=1,2,\ldots$ repeat:
1. With probability $1-\varepsilon$ take the action maximizing $Q_t$. With probability $\varepsilon$ take a random action.
2. Observe reward and calculate $Q_{t+1}$.

Efficient Implementation#

Calculation of the value function estimate $Q_t$ given by the formula above requires storing all actions taken und rewards observed so far. The sums to calculate become longer and longer over time. But a simple reformulation of the formula allows to compute $Q_t$ from $Q_{t-1}$ without the need for storing all actions and rewards.

Fix an action $a$ and a time step $t$. For $t=1$ we set $Q_1(a):=0$. For $t=2,3,\ldots$ the value of $Q_t(a)$ differs from $Q_{t-1}(a)$ only if action $a$ has been chosen in step $t-1$. Now assume that in step $t-1$ action $a$ has been chosen and denote by $t_1,t_2,\ldots,t_n\in\{2,\ldots,t-1\}$ the time steps where we already updated the $Q$ value for $a$ (that is, $a$ has been chosen $n$ times so far, in steps $t_1-1,\ldots,t_n-1$). Then

\[\begin{align*} Q_t(a)&=\frac{1}{n+1}\,\left(r_{t-1}+\sum_{i=1}^{n}r_{t_i-1}\right)\\ &=\frac{1}{n+1}\,\bigl(r_{t-1}+n\,Q_{t_n}(a)\bigr)\\ &=Q_{t_n}(a)+\frac{1}{n+1}\,\bigl(r_{t-1}-Q_{t_n}(a)\bigr). \end{align*}\]

Because $Q_{t_n}(a)=Q_{t_n+1}(a)=\ldots=Q_{t-1}(a)$ we may rewrite this as

\[\begin{equation*} Q_t(a)=Q_{t-1}(a)+\frac{1}{n+1}\,\bigl(r_{t-1}-Q_{t-1}(a)\bigr). \end{equation*}\]

Take care with notation here: $n$ counts how often action $a$ has been chosen before step $t$ and, thus, depends on $t$ and $a$. Time steps $t_1,\ldots,t_n$ are specific to $a$, too.

From the structure of the incremental formula for $Q_t(a)$ we see that $Q_t(a)$ adapts to the most recent reward for action $a$ with damping factor $\frac{1}{n+1}$, which decreases over time.

The Non-stationary Case#

Sample averaging adapts only very slowly to changing reward probabilities, because the more steps we already did the less influence newly observed rewards will have.

There are two simple methods to make sample averaging more adaptive and, thus, more suited for non-stationary tasks:

Use moving averages, that is, only consider the last $N$ rewards for estimating the value function $q^\ast$ with some fixed $N$.
Use weighted averages with weights the smaller the older a reward is.

Implementation of the first idea is straightforward. We have to store the fixed number of most recent rewards and have to use the original formula for calculating average reward, not the incremental one.

The second idea turns out to be a simple, but not so obvious modification of the incremental formula for $Q_t$. We simply have to replace the factor $\frac{1}{n+1}$ by a constant $\alpha\in(0,1]$:

\[\begin{equation*} Q_t(a)=Q_{t-1}(a)+\alpha\,\bigl(r_{t-1}-Q_{t-1}(a)\bigr). \end{equation*}\]

Writing this in a non-incremental way with the notation from above we see the weights:

\[\begin{align*} Q_t(a)&=Q_{t_n}(a)+\alpha\,\bigl(r_{t-1}-Q_{t_n}(a)\bigr)\\ &=\alpha\,r_{t-1}+(1-\alpha)\,Q_{t_n}(a)\\ &=\alpha\,r_{t-1}+(1-\alpha)\,\bigl(\alpha\,r_{t_n-1}+(1-\alpha)\,Q_{t_{n-1}}(a)\bigr)\\ &=\alpha\,r_{t-1}+\alpha\,(1-\alpha)\,r_{t_n-1}+(1-\alpha)^2\,Q_{t_{n-1}}(a)\\ &=\alpha\,r_{t-1}+\alpha\,(1-\alpha)\,r_{t_n-1}+(1-\alpha)^2\,\bigl(\alpha\,r_{t_{n-1}-1}+(1-\alpha)\,Q_{t_{n-2}}(a)\bigr)\\ &=\ldots\\ &=(1-\alpha)^{n+1}\,Q_1(a)+\alpha\,r_{t-1}+\sum_{i=0}^{n-1}\alpha\,(1-\alpha)^{i+1}\,r_{t_{n-i}-1}\\ &=\alpha\,r_{t-1}+\sum_{i=1}^n\alpha\,(1-\alpha)^{n+1-i}\,r_{t_i-1}. \end{align*}\]

The weights $\alpha\,(1-\alpha)^{n+1-i}$ are the smaller the smaller $i$ is, that is, old rewards have less influence an $Q_t(a)$ than more recent rewards. If $\alpha=1$, all weight is at the most recent reward. If $\alpha$ is very close to $0$, then all weights are almost identical.

Optimistic Initialization#

Initialization of $Q_1$ to zero is somewhat arbitrary. Usually rewards are positive and initialization to zero corresponds to ‘no rewards up to now’. But we may use non-zero initial action values to foster exploration in early steps.

If we initialize $Q_1$ to some fixed positive number greater than reward values for all actions, then even with the greedy policy the agent will choose all actions several times before settling to the action with highest average reward. The first action chosen (by chance, because all $Q_1$ values are equal) will lower the high initial value for that action. In the second step all other actions then have higher action values than the action chosen first. Thus, another action will be chosen. Regular behavior of the policy will not start until after all actions have been chosen several times and the influence of initial action values vanishs.

The idea to use high initial action values is known as optimistic initialization. Of course, it’s not suited for non-stationary tasks, because increased exploration tendency is restricted to the first steps. But for stationary tasks optimistic initialization can be an alternative to the $\varepsilon$-greedy policy or it may allow for smaller $\varepsilon$.

Stateless Learning Tasks (Multi-armed Bandits)

Contents