Monte Carlo Methods#

Related projects:

In statistics ‘Monte Carlo’ means `do something random’. Monte Carlo methods in reinforcement learning use (more or less) random policies to explore the environment and average observed returns to get estimates for value functions. While Monte Carlo methods may be extended to continuing tasks, here we restrict our attention to episodic tasks only.

Monte Carlo methods require running a relatively large number of episodes to find a good policy. Thus, whenever possible, we should use a simulator of the real environment to find a sufficiently good policy before running the agent in its real-world environment. If this isn’t possible (because there is no simulator) Monte Carlo methods aren’t a good choice.

The major question to answer when developing Monte Carlo methods is how to ensure that all states are visited sufficiently often to get accurate information about rewards and follow-up states (remember, the environment may have random features). Here we discuss three different approaches:

  • exploring starts (choose random start states and actions),

  • \(\varepsilon\)-soft policies (all actions have nonzero probability),

  • off-policy methods (use separate policies for exploration and exploitation).

Exploring Starts#

If we have a simulated environment at hand, for each episode we may choose initial state and action at random. This ensures thorough exploration of the environment if the number of episodes run is high enough. Starting with some initial random policy we alternate policy evaluation and policy improvement in the same way we did with the policy iteration algorithm in Dynamic Programming.

The policy improvement step choses a greedy policy w.r.t. the estimated value function. We do not need to use an \(\varepsilon\)-greedy policy, because exploration is guaranteed by random initial states and actions. To estimate state or action values we compute mean returns over relevant episodes or subtrajectories thereof.

To estimate the value (expected return) \(q_\pi(s,a)\) of some state-action pair \((s,a)\) if following policy \(\pi\), take all (sub-)traces available so far starting in \(s\) with action \(a\), calculate corresponding returns, and set \(Q_\pi(s,a)\) to be the average of all those returns. Here, \(Q_\pi(s,a)\) is the estimate for the unknown exact value \(q_\pi(s,a)\). This procedure requires storing all traces, which may be infeasible due to memory limitations. Efficiency can be increased by processing a trace directly after finishing the episode and only storing observed returns for each state-action pair:

  1. For each state-action pair \((s,a)\) let \(G(s,a)\) be an empty list (of observed returns).

  2. After each episode with trace \(s_0,a_0,r_1,s_1,\ldots s_{T-1},a_{T-1},r_T,s_T\) do:

    1. Set \(g:=0\).

    2. For \(t=T-1,T-2,\ldots,0\) do:

      1. Replace the value of \(g\) by \(r_{t+1}+\gamma\,g\).

      2. Append \(g\) to \(G(s_t,a_t)\).

      3. Set \(Q_\pi(s_t,a_t)\) to the average of all returns in \(G(s_t,a_t)\).

Sometimes this procedure is formulated for the first occurrence of a state-action pair in a trace only. Ignoring repeated visits to a state-action pair within one episode makes theoretical investigation much simpler, but is disadvantageous in practice. We should exploit collected data as much as possible.

The exploring starts approach works with state values, too. But for the policy improvement step we have to find a greedy policy w.r.t. to the action value function. But calculating action values from state values requires (complete) knowledge or good estimates of the envrionment dynamics. Estimating the environment dynamics is more difficult than estimating action values directly. If the environment behaves deterministically, then estimating state values is feasible and more efficient than estimating action values (there are much fewer state values than action values).

Depending on the concrete task to solve, policy evaluation may be reduced to only one trace per improvement step, if traces are sufficiently ridge.

ε-Soft Policies#

If, like in most real-world settings without simulation, the exploring starts approach is not feasible but one wants to stick to on-policy methods, one has to use policies allowing for exploration.

A policy \(\pi\) is \(\varepsilon\)-soft for some \(\varepsilon\in(0,1)\) if each action has nonzero probability bounded below by

\[\begin{equation*} \pi(a,s)\geq\frac{\varepsilon}{|\mathcal{A}(s)|}, \end{equation*}\]

where \(|\mathcal{A}(s)|\) denotes the number of actions available in state \(s\). Obviously, each \(\varepsilon\)-greedy policy is \(\varepsilon\)-soft.

One can show that \(\varepsilon\)-greedy policies w.r.t. an action value function \(q_\pi\) of an \(\varepsilon\)-soft policy \(\pi\) always are at least as good as \(\pi\). This is the ‘\(\varepsilon\)-version’ of the policy improvement theorem.

Here is a concrete Monte Carlo method based on \(\varepsilon\)-soft policies:

  1. Choose initial state \(s_0\) (same for all episodes).

  2. Choose some initial \(\varepsilon\)-soft policy \(\pi\) (uniformly random, for instance).

  3. Get an estimate \(Q_\pi\) of the value function \(q_\pi\) from running several episodes (identical to exploring starts approach).

  4. Replace \(\pi\) by a policy \(\varepsilon\)-greedy w.r.t. the value function estimate.

  5. Go to 3 as long as policy changes.

If \(a_{\mathrm{g}}(s)\) maximizes \(Q_\pi\) for \(s\) in step 4, an \(\varepsilon\)-greedy policy is given by

\[\begin{equation*} \pi(a,s)=\begin{cases} 1-\varepsilon+\frac{\varepsilon}{|\mathcal{A}(s)|},&\text{if }a=a_{\mathrm{g}}(s),\\ \frac{\varepsilon}{|\mathcal{A}(s)|},&\text{else}. \end{cases} \end{equation*}\]

Learning should start with relatively large \(\varepsilon\). Later on \(\varepsilon\) should be decreased.

Importance Sampling#

Importance sampling is a technique to estimate value functions for one policy from returns obtained by following another policy. This allows to develop off-policy methods. Such methods use a more or less constant behavior policy to explore the environment and continuously improve a target policy from the collected data. After suffcient exploration the target policy can be used to control the agent. This is very similar to A/B testing in Stateless Learning Tasks (Multi-armed Bandits). The target policy does not explore at all. Thus, for non-stationary tasks exploration phases for updating the target policy from time to time have to be incorporated into the agent’s overall behavior.

The overall algorithm is like before: alternate policy evaluation (estimate value function) and policy improvement (replace policy by greedy policy). But in the policy evaluation step we now have to estimate the value function of the target policy from data collected via the behavior policy.

The advantage of off-policy methods is, that they are able to find optimal policies. In contrast, on-policy methods only yield policies which always do some (non-optimal) exploration, too. Exploration in off-policy methods is faster, because the behavior policy may make the agent behave more randomly than the (iteratively optimized) policy in on-policy methods. On the other hand, the transformation of returns obtained from the behavior policy to returns for the target policy may introduce inaccuracies requiring for more episodes than on-policy methods to obtain good results.

Estimating Action Values#

If \(\pi\) denotes current target policy and \(b\) denotes current behavior policy, to get estimates for action values \(q_\pi\) from returns obtained by following \(b\) the behavior policy has to allow at least for all actions which may be chosen by \(\pi\):

\[\begin{equation*} \pi(a,s)>0\quad\Rightarrow\quad b(a,s)>0\qquad\text{for all $s\in\mathcal{S}$, $a\in\mathcal{A}(s)$}. \end{equation*}\]

This property is sometimes referred to as coverage assumption.

To find an estimate \(Q_\pi(s,a)\) of the value \(q_\pi(s,a)\) of a state action pair \((s,a)\), that is, an estimate of the expected return if starting at \(s\) with action \(a\) and following \(\pi\), we may collect sufficiently many traces starting at \((s,a)\) but following policy \(b\), compute their returns, and then use the standard formula for expected values:

\[\begin{equation*} q_\pi(s,a)\approx\sum_{\text{observed traces}}\text{probability of trace under $\pi$}\times\text{observed return}. \end{equation*}\]

But what is the ‘probability of trace under \(\pi\)’? For a concrete trace \(s_0,a_0,r_1,s_1,\ldots s_{T-1},a_{T-1},r_T,s_T\) with environment dynamics \(p\) we have

\[\begin{equation*} \text{probability of trace under $\pi$}=p(s_1,r_1,s_0,a_0)\,\pi(a_1,s_1)\,p(s_2,r_2,s_1,a_1)\cdots\pi(a_{T-1},s_{T-1})\,p(s_T,r_T,s_{T-1},a_{T-1}). \end{equation*}\]

The probability that we observe the exactly same trace if we follow \(b\) instead of \(\pi\) (what we actually do) is

\[\begin{equation*} \text{probability of trace under $b$}=p(s_1,r_1,s_0,a_0)\,b(a_1,s_1)\,p(s_2,r_2,s_1,a_1)\cdots b(a_{T-1},s_{T-1})\,p(s_T,r_T,s_{T-1},a_{T-1}). \end{equation*}\]

Thus,

\[\begin{equation*} q_\pi(s,a)\approx\sum_{\text{observed traces}}\text{probability of trace under $b$}\times\frac{\pi(a_1,s_1)\cdots\pi(a_{T-1},s_{T-1})}{b(a_1,s_1)\cdots b(a_{T-1},s_{T-1})}\times\text{observed return}. \end{equation*}\]

The term \(\frac{\pi(a_1,s_1)\cdots\pi(a_{T-1},s_{T-1})}{b(a_1,s_1)\cdots b(a_{T-1},s_{T-1})}\) is the importance sampling ratio of the trace under consideration.

The probability of observing a trace under \(b\) is a theoretical value. In practice we have to replace it by the relative frequency of the trace within all episodes under consideration. If each trace is observed only once or if we use one summand per episode, even if multiple episodes yielded the same trace, then ‘probability of trace under \(b\)’ can be replaced by 1 over the number of episodes:

\[\begin{equation*} q_\pi(s,a)\approx\sum_{\text{observed traces}}\frac{1}{\text{number of episodes}}\times\frac{\pi(a_1,s_1)\cdots\pi(a_{T-1},s_{T-1})}{b(a_1,s_1)\cdots b(a_{T-1},s_{T-1})}\times\text{observed return}. \end{equation*}\]

Although above derivation is straight-forward, in practice the obtained estimate for \(q_\pi\) yields very poor results. The estimate often shows extremely large or small values. The reason lies in the difference between the probability of observing a trace and the relative frequency of the trace within the set of all observed traces. Importance sampling ratios transform probabilities of traces under \(b\) to probabilities of traces under \(\pi\). But in practice we transform relative frequencies to relative frequencies. For small number of observed traces transformed relative frequencies will not sum up to 1. Thus, transformed relative frequencies do not correspond to a probability distribution anymore. To overcome this problem we should divide transformed relative frequencies by their total sum. With

\[\begin{equation*} h(\text{trace}):=\frac{\pi(a_1,s_1)\cdots\pi(a_{T-1},s_{T-1})}{b(a_1,s_1)\cdots b(a_{T-1},s_{T-1})} \end{equation*}\]

we thus set

\[\begin{equation*} Q_\pi(s,a):=\frac{\sum\limits_{\text{observed traces}}h(\text{trace})\times\text{observed return}}{\sum\limits_{\text{observed traces}}h(\text{trace})}. \end{equation*}\]

This way of estimating \(q_\pi(s,a)\) is known as weighted important sampling. Without division by the sum of transformed relative frequencies it’s ordinary importance sampling.

If \(\pi\) is 0 for some state-action pair of an observed trace, then the whole trace is observed with probability 0 if following \(\pi\). Such traces may appear if following the behavior policy \(b\), but can be savely ignored when computing \(Q_\pi\), because such traces do not carry any information about expected returns for \(\pi\). If \(\pi\) is non-zero for all state-action pairs of a trace, the coverage assumption formulated above ensures that the importance sampling ratio is well-definded for that trace.

The Algorithm#

Before we discuss some important limitations of the off-policy approach, we state the full algorithm for the off-policy Monte Carlo method with weighted importance sampling. A major difference to the derivation above is that we also consider all subtraces in addition to the complete trace of each episode to get as much information as possible from available date.

  1. Let \(b\) be an arbitrary policy with \(b(a,s)>0\) for all \(s\) and \(a\).

  2. Set \(Q(s,a)\) to arbitrary values (usually all zeros).

  3. Let \(\pi\) be a greedy policy with some positive probability mass on all \(Q\)-maximizing actions.

  4. Set \(c(s,a):=0\) for all pairs \((s,a)\) (sum of importance sampling ratios of all traces starting with \((s,a)\)).

  5. Run episodes following \(b\). After each episode with trace \(s_0,a_0,r_1,s_1,\ldots s_{T-1},a_{T-1},r_T,s_T\) do:

    1. Set \(g:=0\) (return).

    2. Set \(h:=1\) (importance sampling ratio).

    3. For \(t=T-1,T-2,\ldots,0\) do:

      1. If \(\pi(s_t,a_t)=0\), stop iteration and proceed with next episode.

      2. Replace the value of \(g\) by \(r_{t+1}+\gamma\,g\).

      3. Replace the value of \(h\) by \(\frac{\pi(a_{t},s_{t})}{b(a_{t},s_{t})}\,h\).

      4. Replace the value of \(c(s,a)\) by \(c(s,a)+h\).

      5. Replace \(Q_\pi(s_t,a_t)\) by \(Q_\pi(s_t,a_t)+\frac{h}{c(s,a)}\,(g-Q_\pi(s_t,a_t))\).

    4. Let \(\pi\) be a greedy policy with some positive probability mass on all \(Q\)-maximizing actions.

    5. Modify \(b\) if there is some reason for (depends on application).

The algorithm may be stopped if changes in \(Q_\pi\) are below some prescribed bound.

Why we do not use an arbitrary greedy policy for \(\pi\) but a non-deterministic one (if possible) will be discussed in the next subsection.

For each episode we handle each subtrace as a trace on its own. Calculation of returns and importance sampling ratios is done in an incremental manner. Remember that \(Q_\pi(s,a)\) is the weighted sum of returns of all (sub-)traces starting at \((s,a)\). If \(h_1,\ldots,h_n\) are corresponding importance sampling ratios, \(c_k:=\sum_{l=1}^k h_l\) is the sum of the first \(k\) ratios, \(g_1,\ldots,g_n\) are the returns, and \(Q_k:=\frac{1}{c_k}\,\sum_{l=1}^k h_l\,g_l\), then

\[\begin{equation*} c_{k+1}=c_k+h_{k+1} \end{equation*}\]

and

\[\begin{align*} Q_{k+1} &=\frac{h_{k+1}\,g_{k+1}+\sum\limits_{l=1}^k h_l\,g_l}{c_{k+1}}\\ &=\frac{h_{k+1}}{c_{k+1}}\,g_{k+1}+\frac{c_k}{c_{k+1}}\,Q_k\\ &=\frac{h_{k+1}}{c_{k+1}}\,g_{k+1}+\frac{c_{k+1}-h_{k+1}}{c_{k+1}}\,Q_k\\ &=\frac{h_{k+1}}{c_{k+1}}\,g_{k+1}+\left(1-\frac{h_{k+1}}{c_{k+1}}\right)\,Q_k\\ &=Q_k+\frac{h_{k+1}}{c_{k+1}}\,(g_{k+1}-Q_k), \end{align*}\]

which is exactly what happens in the algorithm above.

Traces containing a state action pair \((s,a)\) with \(\pi(s,a)=0\) do not modify the estimate \(Q_\pi(s,a)\). Thus, we stop processing of subtraces of an episode as soon as we recognize such a pair (remember that we process subtraces starting with the tails).

The policy \(b\) may be changed at any time, for instance to focus exploration on certain aspects or regions of the environment.

Limitations#

Having too many traces with \(\pi(s,a)=0\) for some state-action pair \((s,a)\) should be avoided, because such traces are useless for improving \(\pi\). If \(\pi\) is a deterministic policy, then a trace will contain a pair with zero \(\pi\) as soon as the trace differs from a trace obtained by following \(\pi\). This renders the off-policy approach useless because all truely exploring traces will be dropped. Only traces identical to traces obtainable via \(\pi\) will have influnce on \(Q_\pi\). But then we could use an on-policy approach directly.

The only way out is to keep \(\pi\) non-deterministic as long as possible. During first iterations that’s not difficult. But after sufficiently many updates to \(Q_\pi\) there will be only one maximizing action for most states. In this stage the off-policy approach behaves almost like on-policy techniques with large overhead (drops most traces). For deterministic \(\pi\) exact behavior in each episode is:

  1. Take a random action in first step (according to \(b\)).

  2. Follow \(\pi\) for all further steps.

Traces or subtraces not following this scheme will be dropped in case of deterministic \(\pi\).

The described problem may lead to very slow learning. But there also exist some modifications to (at least partially) overcome this limitation. See Sutton/Barto’s book, section 5.8 and 5.9 for more information.