Policy Evaluation#

In preparation of developing concrete reinforcement learning algorithms we think about strategies for computing the weights of a value function. We restrict our attention to state-action values (action values for short), because they are much more relevant in practice than state values. For state values everything looks quite similar and can easily be deduced from the considerations we describe for action values.

Given the environment (mapping state-action pairs to states and rewards) and a policy \(\pi\) we want to find the value function \(q_\pi\), that is, the expected return for all state-action pairs. Since we aim at approximate value function methods, we look for an estimate \(Q_w\) of \(q_\pi\), where \(w\in\mathbb{R}^p\) is the vector of weights by which \(Q_w\) is completely determined.

Supervised Learning#

Computing \(Q_w\) is a typical supervised learning problem: find weights \(w\) such that \(Q_w\) maps state-action pairs (inputs) to expected returns (outputs). We have to solve to problems:

  • Collect training samples.

  • Choose a concrete model for \(Q_w\) and calculate \(w\) from training samples.

Collecting trainings samples is equivalent to exploration in reinforcement learning. But the direct outcome of exploration are rewards, not expected returns. Thus, we need to give some thought to how to deduce expected returns from observed rewards.

In reinforcment learning the model \(Q_w\) almost always is an ANN. Thus, calculating weights \(w\) requires a loss function, which then is minimized with repspect to \(w\) via some gradient descent method.

Training Targets#

The training targets, that is, the outputs of the model \(Q_w\), are expected returns, but from exploration we only have rewards at hand. Thus, expected return has to be estimated from observed rewards. We already solved this problem in several different ways for tabular methods. Here we may use exactly the same ideas:

  • Monte Carlo methods: Run a full episode and calculate actual return for each state-action pair visited in the episode. If a pair is visited multiple times (maybe in several episodes), use the mean of all observed returns as an estimate for the expected return, that is,

    \[\begin{equation*} q_\pi(s,a)\approx\text{mean of observed returns for episodes starting at $(s,a)$}. \end{equation*}\]
  • SARSA (on-policy TD learning): Look one step ahead, that is, if \((s,a)\) results in reward \(r\) and state-action pair \((s',a')\), use

    \[\begin{equation*} q_\pi(s,a)\approx r+\gamma\,Q_w(s',a'). \end{equation*}\]
  • Q-learning (off-policy TD learning): Look one step ahead and use a greedy policy for action selection, that is, if \((s,a)\) results in reward \(r\) and state \(s'\), use

    \[\begin{equation*} q_\pi(s,a)\approx r+\gamma\,\max_{a'}Q_w(s',a'). \end{equation*}\]

Each of these three tabular methods gives rise to an approximate value function method. Full algorithms for SARSA and Q-learning will be developed in Policy Improvement and Deep Q-Learning, respectively.

Loss Function and Gradient#

Given training samples \((s_1,a_1,y_1),\ldots,(s_n,a_n,y_n)\) with inputs \((s_l,a_l)\) and targets \(y_l\) we want to find the weigths \(w\) of \(Q_w\) by gradient descent. The loss function for training is usual mean squared error, but with additional weight factors \(\mu_k\):

\[\begin{equation*} L(w):=\sum_{l=1}^n\mu_l\,\bigl(Q_w(s_l,a_l)-y_l\bigr)^2 \end{equation*}\]

The weights \(\mu_l\) account for the number of visits to \((s_l,a_l)\) during generation of training samples. Alternatively, state-action pairs visited several times have to occur several times in the training data, too. Without weights or multiple occurrence of samples all state-action pairs would have identical influence on the training, yielding poor results due to high errors in estimates of expected returns for state-action pairs with few visits only.

Computing the gradient of \(L\) with respect to \(w\) is difficult because the targets \(y_k\) may depend on \(w\), too (in case of SARSA and Q-learing). A simple way out is to ignore this dependency completely:

\[\begin{equation*} \nabla L(w)=2\,\sum_{l=1}^n\mu_l\,\bigl(Q_w(s_l,a_l)-y_l\bigr)\,\nabla_w Q_w(s_l,a_l). \end{equation*}\]

At the first glance this seems a bit dubious, but there’s good justification for this approach: To get training samples we had to replace the true target (expected return) by some approximation. If we switch this approximation step and the computation of the gradient, targets in \(L\) do not depend on \(w\) anymore. Thus, computing the gradient of \(L\) is simple. But now the gradient contains expected rewards, which are inaccessible in practice. So we replace them by one of the approximations given above. The outcome is completely equivalent to the original approach (approximate targets, ignore targets’ dependence on \(w\)).

Sometimes instead of gradient the term semi-gradient is used to emphasize the fact, that something is not standard here.