Policy Improvement#

For approximate value functions policy improvement should work as usual: get a good estimate of the value function for some initial policy, get corresponding (\(\varepsilon\)-)greedy policy, get an estimate of the value function for the new policy, and so on. This procedure should yield better policies (higher expected returns) step by step. But looking at the details we are facing several problems.

Problem 1: Large Action Space#

Given a value function estimate \(Q_w\), following corresponding (\(\varepsilon\)-)greedy policy requires maximization of \(Q_w(s,a)\) with respect to \(a\) (and fixed current state \(s\)). For large action spaces calculating \(Q_w(s,a)\) for all \(a\) is too expensive or even impossible (due to memory limitations). Thus, numerical minimization techniques have to be employed, which typically are slow, expensive and unreliable.

With ANNs in mind gradient based numerical optimization with respect to actions is equivalent to maximizing ANN output with respect to ANN inputs. This requires automatic differentiation as well as one ANN evaluation per gradient step.

Examples for large action spaces are the steering angle in autonomous driving (continuous action space) and choosing web ads (each available ad is an action).

Up to now there is no commonly accepted reinforcement learning technique for large action spaces!

The only way out is to enforce sufficiently small discrete action spaces. Continuous action spaces have to be discretized and similar discrete actions possibly have to be joined into one action. For steering angles this could mean, for instance, that the only allowed angles are multiples of 10 degree.

Problem 2: No Optimal Policy#

For tabular methods we know that there always is an optimal policy. Remember that a policy is optimal, if its value function is nowhere smaller than the value function of any other policy. For approximate value function methods this result no longer holds true.

The reason is not quite obvious. In approximate value function methods we first decide for a (then fixed) parametrization scheme. All value function estimates have to follow this scheme, that is, they all have the same structure with the same number of weights (parameters). The only difference are the concrete values for the weights. Although there might be an optimal value function, there’s only little chance that it can be represented by the chosen parametrization scheme. Thus, value function estimates of all (!) policies will differ significantly from the optimal value function if the optimal value function is not representable by the chosen scheme. Even if we have found the optimal policy we won’t recognize this fact, because its value function estimate will significantly differ from the optimal value function and, in addition, we do not have prior knowledge about the optimal value function in practice.

To illustrate this problem take a reinforcement learning task with four different states \(s_1,s_2,s_3,s_4\) and five actions A, B, C, D, E. The figure below defines the behavior of the deterministic environment.

environment dynamics for an environment with four states and five actions

Fig. 80 The environment dynamics for the deterministic environment. Taking action B in state \(s_1\) yields reward -1 and brings the environment into state \(s_3\), for instance. In state \(s_1\) only actions A and B are available, in state \(s_2\) there’s only one action, and so on.#

Our model for value function estimates \(Q_w\) has three weights and following structure:

\[\begin{align*} Q_w(s_1,\mathrm{A})&:=w_1\\ Q_w(s_1,\mathrm{B})&:=w_1\\ Q_w(s_2,\mathrm{C})&:=w_2\\ Q_w(s_3,\mathrm{D})&:=w_2\\ Q_w(s_4,\mathrm{E})&:=w_3 \end{align*}\]

Note that there are fewer weights than state-action pairs, which is the typical situation for large state and action spaces.

For simplicity we set the discounting parameter \(\gamma:=0\), that is, return equals immediate reward. The optimal value function then assigns immediate rewards to all state-action pairs:

\[\begin{align*} q_\ast(s_1,\mathrm{A})&:=1\\ q_\ast(s_1,\mathrm{B})&:=-1\\ q_\ast(s_2,\mathrm{C})&:=-1\\ q_\ast(s_3,\mathrm{D})&:=1\\ q_\ast(s_4,\mathrm{E})&:=0\\ \end{align*}\]

For the task under consideration there are only two (deterministic) policies, because there’s only one state (\(s_1\)) allowing for more than one action. Let \(\pi_{\mathrm{A}}\) be the policy choosing action A in \(s_1\) and let \(\pi_{\mathrm{B}}\) be the other policy. Denote the weight vectors for corresponding value function estimates by \(w^{\mathrm{A}}\) and \(w^{\mathrm{B}}\).

With \(\pi_{\mathrm{A}}\) the state action pair \((s_1,\mathrm{B})\) is never visited. Thus, \(w^{\mathrm{A}}_1\) is determined by the reward for taking A in \(s_1\). Pair \((s_3,\mathrm{D})\) never gets visited, too (except for starting in \(s_3\)). Thus, \(w^{\mathrm{A}}_2\) is determined by the reward for taking C in \(s_2\). Analogous observations hold true for \(w^{\mathrm{B}}_1\) (determined by taking B in \(s_1\)) and \(w^{\mathrm{B}}_2\) (determined by taking D in \(s_3\)). Value function estimates are:

\[\begin{align*} Q_{w^{\mathrm{A}}}(s_1,\mathrm{A})&:=1&\qquad\qquad Q_{w^{\mathrm{B}}}(s_1,\mathrm{A})&:=-1\\ Q_{w^{\mathrm{A}}}(s_1,\mathrm{B})&:=1&\qquad\qquad Q_{w^{\mathrm{B}}}(s_1,\mathrm{B})&:=-1\\ Q_{w^{\mathrm{A}}}(s_2,\mathrm{C})&:=-1&\qquad\qquad Q_{w^{\mathrm{B}}}(s_2,\mathrm{C})&:=1\\ Q_{w^{\mathrm{A}}}(s_3,\mathrm{D})&:=-1&\qquad\qquad Q_{w^{\mathrm{B}}}(s_3,\mathrm{D})&:=1\\ Q_{w^{\mathrm{A}}}(s_4,\mathrm{E})&:=0&\qquad\qquad Q_{w^{\mathrm{B}}}(s_4,\mathrm{E})&:=0 \end{align*}\]

Because

\[\begin{equation*} Q_{w^{\mathrm{A}}}(s_3,\mathrm{D})<q_\ast(s_3,\mathrm{D})\qquad\text{and}\qquad Q_{w^{\mathrm{B}}}(s_1,\mathrm{A})<q_\ast(s_1,\mathrm{A}), \end{equation*}\]

we cannot decide which of the two policies is the optimal one.

To solve the problem that with approximate value functions finding an optimal policy might be impossible we will have to weaken the notion of optimality, see below.

Problem 3: No Improvement#

Improving the value estimate for one state-action pair does not necessarily improve corresponding (\(\varepsilon\)-)greedy policy, because other values will change, too. Maybe we can’t even say which one is better, original or ‘improved’ one.

The only way out is to weaken the notion of ‘better’ when comparing policies.

Weakening the Notion of Optimality#

Motivated by problems 2 and 3 above we want to weaken the notion of ‘better’ when comparing policies, which then implies a weaker notion of optimality, too.

Remember that for tabular methods the notion of optimality of a policy is based on pointwise properties of state value functions. Exactly this pointwise approach leads to problems 2 and 3 above for policies based on approximate value functions. A way out is to find a quality measure for policies consisting of only one real number. Then we may compare the quality of policies via that number.

Given the state value function \(v_\pi\) of a policy \(\pi\) the average state value of \(\pi\) is

\[\begin{equation*} V(\pi):=\sum_{s\in\mathcal{S}}\mu_\pi(s)\,v_\pi(s), \end{equation*}\]

where \(\mu_\pi(s)\) is the probability of visiting state \(s\) if starting at some state chosen uniformly at random and then following \(\pi\). In case of infinitely many discrete states \(V(\pi)\) is given by a series. In case of a continuous state space \(V(\pi)\) is given by an integral and \(\mu_\pi\) is a probability density function. In all what follows we assume that there are finitely many states only to simplify notation. But derived algorithms will work for infinite and continuous state spaces, too.

We now may say that a policy \(\pi_1\) is at least as good as \(\pi_2\) if \(V(\pi_1)\geq V(\pi_2)\). A policy \(\pi\) is optimal if

\[\begin{equation*} V(\pi)\geq V(\tilde{\pi})\qquad\text{for all policies $\tilde{\pi}$.} \end{equation*}\]

With this weaker notion of optimality we do not see anymore in which state a non-optimal policy has to be improved. But at least we are able to compare policies again, which in turns allows to decide whether a policy can be improved further (or has been improved by some manipulation).

Deprecating Discounting#

A surprising consequence of weakening the notion of optimality is that the discounting factor \(\gamma\) used in the return definition for continuing tasks does not have any influence on the ordering of policies: The Bellman equations for state values yield

\[\begin{align*} V(\pi) &=\sum_{s\in\mathcal{S}}\mu_\pi(s)\,v_\pi(s)\\ &=\sum_{s\in\mathcal{S}}\mu_\pi(s)\,\sum_{a\in\mathcal{A}(\mathcal{S})}\pi(a,s)\,\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,\bigl(r+\gamma\,v_\pi(s')\bigr), \end{align*}\]

where \(p\) denotes the environment dynamics. With

\[\begin{equation*} r(\pi):=\sum_{s\in\mathcal{S}}\mu_\pi(s)\,\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,r \end{equation*}\]

and reordering summation this becomes

\[\begin{align*} V(\pi) &=r(\pi)+\sum_{s\in\mathcal{S}}\mu_\pi(s)\,\sum_{a\in\mathcal{A}(\mathcal{S})}\pi(a,s)\,\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,\gamma\,v_\pi(s')\\ &=r(\pi)+\gamma\,\sum_{s'\in\mathcal{S}}v_\pi(s')\,\sum_{s\in\mathcal{S}}\mu_\pi(s)\,\sum_{\substack{a\in\mathcal{A}(s)\\r\in\mathcal{R}}}\pi(a,s)\,p(s',r,s,a)\\ &=r(\pi)+\gamma\,\sum_{s'\in\mathcal{S}}v_\pi(s')\,\mu_\pi(s')\\ &=r(\pi)+\gamma\,V(\pi). \end{align*}\]

Isolating \(V(\pi)\) now yields

\[\begin{equation*} V(\pi)=\frac{1}{1-\gamma}\,r(\pi). \end{equation*}\]

The discounting parameter \(\gamma\) is just a scaling factor in average state values. Thus, if some policy \(\pi_1\) is better than \(\pi_2\) for small \(\gamma\) (long-term rewards not of importance), \(\pi_1\) will be better than \(\pi_2\) for \(\gamma\) close to 1 (long-term rewards are of importance), too. Although for value functions \(\gamma\) does not matter anymore, for value function estimates the choice of \(\gamma\) may still have some influence on the quality of the estimate in terms of convergence to the true value function.

Differential Return#

The quantity

\[\begin{equation*} r(\pi):=\sum_{s\in\mathcal{S}}\mu_\pi(s)\,\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,r \end{equation*}\]

in above derivation is known as average reward of the policy \(\pi\). It’s the reward we can expect in each step if following \(\pi\) after starting at some state chosen uniformly at random.

Since the discounting factor \(\gamma\) does not influence ordering of policies given by average state values \(V(\pi)\), we may use \(r(\pi)\) for definition of optimality in case of approximate value functions: A policy is optimal if it maximizes average reward. And \(\pi_1\) is at least as good as \(\pi_2\) if \(r(\pi_1)\geq r(\pi_2)\). With this, \(\gamma\) vanished from the definition of optimality without changing the meaning.

The question is whether we can get rid of \(\gamma\) in the definition of returns, too. Remember that we introduced discounting to avoid infinite returns in case of continuing tasks. So we have to find an alternative way for avoiding infinite returns. Indeed, instead of summing rewards directly we may sum differences between rewards and average reward: Given a trace \(s_0,a_0,r_1,s_1,a_1,r_1,\ldots\) we look at

\[\begin{equation*} G(\text{trace}):=r_1-r(\pi)+r_2-r(\pi)+\ldots. \end{equation*}\]

This quantity is known as differential return. Each summand \(r_t-r(\pi)\) is close to zero in the sense that its expectation (or its average over many traces) is zero. Thus, there’s good chance (but no guarantee!) that \(G(\text{trace})<\infty\).

The definition of value functions does not change. The value \(v_\pi(s)\) of a state \(s\) still is the expected return if starting in \(s\) and following \(\pi\). The value \(q_\pi(s,a)\) of a state-action pair \((s,a)\) still is the expected return if starting in \(s\) with action \(a\) and then following \(\pi\). But instead of discounted return we now use differential return.

The state value \(v_\pi(s)\) may be regarded as the difference between expected return obtained from starting at \(s\) and the policy’s average expected return \(V(\pi)\). Analogously, \(q_\pi(s,a)\) may be regarded as the difference between expected return obtained from taking action \(a\) in start state \(s\) and the policy’s average average expected return \(V(\pi)\).

Discounted or Differential Return?#

We now have two settings for computing return in continuing tasks, discounted return and differential return. The questions, which one to prefer, has no clear answer. Both variants are used in practice and no final decision has been made up to now by the reinforcement learning community.

We introduced differential return as a consequence of weakening the notion of optimality for policies, which in turn originated from considering approximate value functions. Although the discounted return setting lead to some trouble with theory (possible non-approximability of optimal value functions), algorithms based on discounting still work with approximate value functions. Of course, together with optimal policies we also lost the policy improvement theorem, which has been the motivation for all our algorithms. But for the differential return setting we do not have a policy improvement theorem, too. We have to live with the fact, that all algorithms described below for approximate value functions lack theoretical justification. We simply hope that the idea of the policy improvement theorem also works for approximate value functions sufficiently well.

In practice, differential return does not require us to choose a parameter for discounting. Having fewer parameters to choose values for generally is a very good thing. But in turn, with differential returns all past exploration data has identical influence on the agent’s behavior. In the discounted setting the agent forgets about the long past and adapts to changing environments more rapidly (depending on the choice of \(\gamma\)). From the theoretical point of view there’s no difference between forgetting and not forgetting the past. A policy either is (weakly) optimal in both cases or in non of both. But in practice, when expected returns in fact are estimates obtained from few observations, discounting may yield faster learning progress in some situations.

SARSA with Discounted Return#

Now we are ready to formulate concrete algorithms. Here we focus on on-policy methods (SARSA). Off-policy methods (Q-learning) will be discussed later on.

  1. Initialization

    1. Choose \(\gamma\in[0,1)\) (discounting), \(\alpha>0\) (step size for gradient descent) and \(\varepsilon\in(0,1)\) (for \(\varepsilon\)-greedy policy).

    2. Choose initial weights \(w\) for \(Q_w\) (at random, for instance).

    3. Choose initial state \(s\) and initial action \(a\) (at random, for instance).

  2. Step

    1. Take action \(a\) and observe reward \(r\) and new state \(s'\).

    2. Choose next action \(a'\) by \(\varepsilon\)-greedy policy w.r.t. \(Q_w\).

  3. Update

    1. Replace weights \(w\) by

      \[\begin{equation*} w+\alpha\,\bigl(r+\gamma\,Q_w(s',a')-Q_w(s,a)\bigr)\,\nabla_w\,Q_w(s,a). \end{equation*}\]
    2. Replace \(s\) by \(s'\) and \(a\) by \(a'\).

    3. Go to 2.1.

In step 3.1 the term \(r+\gamma\,Q_w(s',a')\) is the new return estimate. Cf. Policy Evaluation for the gradient of the loss function.

Note that there’s no stopping criterion, because in case of continuing tasks the algorithm shall run forever. If the task is episodic, the algorithm stops if an end state has been reached (no more action to choose from). The last weight update then uses \(r\) instead of \(r+\gamma\,Q_w(s',a')\), because return equals reward if reaching an end state. For each new episode the algorithm is rerun, but without steps 1.1 and 1.2.

SARSA with Differential Return#

Computing differential return requires knowledge of the average reward \(r(\pi)\), which in turn requires knowledge of the environment dynamics. If we do not have such knowledge (this is the common situation), we have to estimate \(r(\pi)\) from data collected by the agent. If \(r_1,r_2,\ldots\) is the sequence of observed rewards a simple estimate \(\overline{r}_t\) after time step \(t\) for \(r(\pi)\) is given by

\[\begin{equation*} \overline{r}_1:=r_1,\qquad\overline{r}_t:=\overline{r}_{t-1}+\beta\,(r_t-\overline{r}_{t-1})\quad\text{for $t>1$}, \end{equation*}\]

where \(\beta>0\) controls how fast the estimates adapts to most recent rewards. Note that this is not exactly an estimate for \(r(\pi)\), because very old rewards have less influence on the estimate (cf. discounted return). But this estimate allows for better adaption to changing environment dynamics in non-stationary tasks.

To simplify algorithms one may use \(\overline{r}_0:=0\) instead of \(\overline{r}_1:=r_1\) for initialization.

Initialization and step are identical (up to initialization of \(\beta\) and \(\overline{r}\)) for both SARSA with discounted return and SARSA with differential return. Only the weight update is different and for differential returns we have to update the estimate for the average reward in each step.

  1. Initialization

    1. Choose \(\alpha>0\) (step size for gradient descent), \(\beta>0\) (step size for average reward update) and \(\varepsilon\in(0,1)\) (for \(\varepsilon\)-greedy policy).

    2. Choose initial weights \(w\) for \(Q_w\) (at random, for instance) and set \(\overline{r}:=0\).

    3. Choose initial state \(s\) and initial action \(a\) (at random, for instance).

  2. Step

    1. Take action \(a\) and observe reward \(r\) and new state \(s'\).

    2. Choose next action \(a'\) by \(\varepsilon\)-greedy policy w.r.t. \(Q_w\).

  3. Update

    1. Replace \(\overline{r}\) by \(\overline{r}+\beta\,(r-\overline{r})\).

    2. Replace weights \(w\) by

      \[\begin{equation*} w+\alpha\,\bigl(r-\overline{r}+Q_w(s',a')-Q_w(s,a)\bigr)\,\nabla_w\,Q_w(s,a). \end{equation*}\]
    3. Replace \(s\) by \(s'\) and \(a\) by \(a'\).

    4. Go to 2.1.