Policy Gradient Methods#

Related projects:

Cart Pole
- Policy Gradient Method

All methods considered so far for reinforcement learning were based on value functions. They tried to find good estimates for value functions and then used the (\(\varepsilon\)-)greedy policy with respect to a value function. No we have an introductory look at methods optimizing policies more directly.

Going on from tabular methods to approximate value function methods like deep Q-learning we lost the important policy improvement theorem, which is the theoretical foundation for tabular methods. Thus, for approximate methods we cannot be sure that they really work in all cases. For policy gradient methods theoretical justification will be slightly better. Apart from this advantage, practice shows that for some tasks value function methods yield better results and for other tasks policy gradient methods work better. There’s no general rule which class to use.

The Idea#

Remember that a policy \(\pi\) is a function mapping a state-action pair \((s,a)\) to the probability that the agent chooses action \(a\) in state \(s\). This function can be represented as a table with \(|\mathcal{S}|\times|\mathcal{A}|\) cells whoes values are to be chosen to maximize some quality measure for policies. In case of large state or action spaces we may replace the tabular approach by some kind of parametrized policies. Here we skip the less relevant tabular case and focus an ANN based policies.

Let \(\pi_\theta\) be a policy represented by an ANN (or some other parametrization) with weight vector \(\theta\). Optimizing \(\pi_\theta\) to solve a given task means choosing good weights \(\theta\). By \(J\) we denote a quality measure mapping policies to real numbers. Good quality measures are all kinds of expected returns. For episodic tasks \(J\) is the mean total reward with respect to all possible traces:

\[\begin{equation*} J(\pi_\theta)=\sum_{\text{traces}}\text{probability of trace}\times\text{total reward of trace}. \end{equation*}\]

For continuing tasks we use a policy’s average reward \(r(\pi_\theta)\) (cf. Policy Improvement):

\[\begin{equation*} J(\pi_\theta)=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,\sum_{a\in\mathcal{A}(s)}\pi_\theta(a,s)\,\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,r. \end{equation*}\]

The common idea of all kinds of policy gradient methods now is to maximize \(J(\pi_\theta)\) with respect to \(\theta\) via gradient ascent

\[\begin{equation*} \theta_{k+1}:=\theta_k+\alpha\,\nabla_\theta J(\pi_\theta), \end{equation*}\]

where \(\alpha>0\) is the step length. Thus, the major step in deriving concrete algorithms is the computation of the gradient \(\nabla_\theta J(\pi_\theta)\).

Derivation of the Gradient#

We derive the gradient \(\nabla_\theta J(\pi_\theta)\) only for continuing tasks here. For episodic tasks result differs only by a factor related to average episode length. This factor is of little importance due to multiplication of gradients by step lengths.

For the derivation of the gradient we drop the subscript \(\theta\) to keep formulas simple. So we write \(\nabla\) instead of \(\nabla_\theta\), \(\pi\) instead of \(\pi_\theta\) and so on. We cannot go the straight-forward way for deriving the gradient because this would introduce terms \(\nabla\mu_\pi\) we cannot simplify further. Instead we use a little trick: we compute the gradient of some other quantity and then extract \(\nabla J\) from corresponding formula. For the gradient of the differential return based state value function \(v_\pi\) and fixed arbitrary state \(s\) we have

\[\begin{align*} \nabla v_\pi(s) &=\nabla\left(\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,q_\pi(s,a)\right)\\ &=\sum_{a\in\mathcal{A}(s)}\bigl(\nabla\pi(a,s)\,q_\pi(s,a)+\pi(a,s)\,\nabla q_\pi(s,a)\bigr) \end{align*}\]

by the product rule. With

\[\begin{align*} \nabla q_\pi(s,a) &=\nabla\left(\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,\bigl(r-J(\pi)+v_\pi(s')\bigr)\right)\\ &=\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,\bigl(-\nabla J(\pi)+\nabla v_\pi(s')\bigr)\\ &=-\nabla J(\pi)+\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,\nabla v_\pi(s') \end{align*}\]

we obtain

\[\begin{align*} \nabla v_\pi(s) &=\sum_{a\in\mathcal{A}(s)}\left(\nabla\pi(a,s)\,q_\pi(s,a)+\pi(a,s)\,\left(-\nabla J(\pi)+\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,\nabla v_\pi(s')\right)\right)\\ &=-\nabla J(\pi)+\sum_{a\in\mathcal{A}(s)}\nabla\pi(a,s)\,q_\pi(s,a)+\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,\nabla v_\pi(s'). \end{align*}\]

Consequently,

\[\begin{equation*} \nabla J(\pi)=-\nabla v_\pi(s)+\sum_{a\in\mathcal{A}(s)}\nabla\pi(a,s)\,q_\pi(s,a)+\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,\nabla v_\pi(s'). \end{equation*}\]

for all \(s\). Now we multiply this equation by \(\mu_\pi(s)\) and sum over all \(s\) (remember that the sum of all \(\mu_\pi(s)\) equals 1):

\[\begin{align*} \nabla J(\pi) &=-\sum_{s\in\mathcal{S}}\mu_\pi(s)\,\nabla v_\pi(s)+\sum_{s\in\mathcal{S}}\mu_\pi(s)\,\sum_{a\in\mathcal{A}(s)}\nabla\pi(a,s)\,q_\pi(s,a)\\ &\qquad\qquad+\sum_{s\in\mathcal{S}}\mu_\pi(s)\,\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,\nabla v_\pi(s'). \end{align*}\]

Together with

\[\begin{align*} \sum_{s\in\mathcal{S}}\mu_\pi(s)\,\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{\substack{s'\in\mathcal{S}\\r\in\mathcal{R}}}p(s',r,s,a)\,\nabla v_\pi(s')&\\ &\hspace{-6cm}=\sum_{s'\in\mathcal{S}}\nabla v_\pi(s')\,\sum_{s\in\mathcal{S}}\mu_\pi(s)\,\sum_{\substack{a\in\mathcal{A}(s)\\r\in\mathcal{R}}}\pi(a,s)\,p(s',r,s,a)\\ &\hspace{-6cm}=\sum_{s'\in\mathcal{S}}\nabla v_\pi(s')\,\mu_\pi(s') \end{align*}\]

we arrive at

\[\begin{equation*} \nabla_\theta J(\pi_\theta)=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,\sum_{a\in\mathcal{A}(s)}\nabla_\theta\pi_\theta(a,s)\,q_{\pi_\theta}(s,a). \end{equation*}\]

Here we see that the computation of \(\nabla_\theta J\) reduces to the computation of the policy’s gradient. Thus the name ‘policy gradient methods’.

The other two quantities involved are the distribution \(\mu_{\pi_\theta}\) of visits to states, which will be automatically determined when computing estimates of \(\nabla_\theta J\) from an agent’s observations, and the action value function \(q_{\pi_\theta}\), which will be estimated from observations, too.

REINFORCE#

REINFORCE uses the idea of Monte Carlo Methods (sample empirical return many times) to estimate the gradient \(\nabla_\theta J(\pi_\theta)\) and is, thus, restricted to episodic tasks.

Assuming \(\pi_\theta(a,s)>0\) for all \(a\) and all \(s\). We may rewrite \(\nabla_\theta J(\pi_\theta)\) as follows:

\[\begin{align*} \nabla_\theta J(\pi_\theta) &=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,\sum_{a\in\mathcal{A}(s)}q_{\pi_\theta}(s,a)\,\nabla_\theta\pi_\theta(a,s)\\ &=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,\sum_{a\in\mathcal{A}(s)}\pi_\theta(a,s)\,q_{\pi_\theta}(s,a)\,\frac{\nabla_\theta\pi_\theta(a,s)}{\pi_\theta(a,s)}\\ &=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,\sum_{a\in\mathcal{A}(s)}\pi_\theta(a,s)\,q_{\pi_\theta}(s,a)\,\nabla_\theta\ln\pi_\theta(a,s). \end{align*}\]

This is the expected value of \(q_{\pi_\theta}(s,a)\,\nabla_\theta\ln\pi_\theta(a,s)\) with respect to all state-action pairs \((s,a)\) visited under \(\pi_\theta\). So estimating \(\nabla_\theta J(\pi_\theta)\) reduces to observing some traces by following \(\pi_\theta\), computing an estimate for \(q_{\pi_\theta}\) (average observed return for \((s,a)\)), and computing an estimate for \(\nabla_\theta\ln\pi_\theta(a,s)\) (average over all visits to \((s,a)\)).

Here is the full algorithm:

Choose step length \(\alpha>0\).
Choose initial weights \(\theta\) (at random, for instance).
For each episode (following \(\pi_\theta\)) with trace \(s_0,a_0,r_1,s_1,\ldots,r_T,s_T\) do:
1. For each \(t=0,1,\ldots, T-1\) do:
  1. Update \(\theta\) to \(\theta+\alpha\,\left(\sum\limits_{k=t+1}^T r_k\right)\nabla_\theta\ln\pi_\theta(a_t,s_t)\),

Here we compute the gradient after each step, that is, ‘estimating’ reduces to one sample only.

Note that the assumption \(\pi_\theta(a,s)>0\) is always satisfied for visited \((s,a)\), else the pair would not have been visited.

Actor-Critic Methods#

Actor-critic methods use the idea of Temporal Difference Learning (TD Learning) (look one step ahead for value function estimates) to estimate the gradient \(\nabla_\theta J(\pi_\theta)\) and are, thus, suitable for episodic and continuing tasks. Here we restrict our attention to a SARSA-like actor-critic method for continuing tasks with differential return.

Next to \(\pi_\theta\) also a parametrized state value function estimate \(V_w\) with weights \(w\) will be computed. The policy is the actor choosing actions. The value function is the critic telling us whether the policy does a good job and how much (but not in which way) the policy should get improved. In contrast to REINFORCE \(q_{\pi_\theta}\) is not estimated from one trace, but from all observations collected so far. The full algorithm is as follows:

Initialization
1. Choose \(\alpha_1>0\) (step length for updating \(V_w\)), \(\alpha_2>0\) (step length for updating \(\pi_\theta\)), \(\beta\in(0,1]\) (step length for updating average reward.
2. Choose initial weights \(w\) and \(\theta\) (at random, for instance).
3. Choose initial average reward \(\overline{r}\) (zero, for instance).
4. Choose initial state \(s\).
Step
1. Take action \(a\) according to \(\pi_\theta\).
2. Observe reward \(r\) and next state \(s'\).
Update
1. Replace \(\overline{r}\) by \(\overline{r}+\beta\,(r-\overline{r})\).
2. Replace \(w\) by \(w+\alpha_1\,\big(r-\overline{r}+V_w(s')-V_w(s)\bigr)\nabla_w V_w(s)\).
3. Replace \(\theta\) by \(\theta+\alpha_2\,\big(r-\overline{r}+V_w(s')\bigr)\nabla_\theta\ln\pi_\theta(a,s)\).
4. Replace \(s\) by \(s'\).
5. Go to 2.1.

Learning progress of actor-critic methods tends to be more stable, because estimates for \(q_{\pi_\theta}\) are more stable than with REINFORCE. But computational costs are higher due to an additional ANN for representing the value function estimate \(V_w\).

Baselines#

When discussing REINFORCE, we derived the formula

\[\begin{equation*} \nabla_\theta J(\pi_\theta)=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,\sum_{a\in\mathcal{A}(s)}\pi_\theta(a,s)\,q_{\pi_\theta}(s,a)\,\nabla_\theta\ln\pi_\theta(a,s) \end{equation*}\]

for the gradient of the quality measure to be maximized. Here we see that the length of \(\nabla_\theta J(\pi_\theta)\) and, thus, the behavior of gradient ascent depends on the values \(q_{\pi_\theta}(s,a)\). The range of those values is determined by the range of rewards, which can be freely chosen. Stability and convergence of gradient ascent heavily depends on how we model the environment. To overcome potential problems resulting from this degree of freedom we introduce the concept of baselines.

A baseline is any function \(b\) mapping states to real numbers. A baseline is ot allows to depend on chosen actions in any way. For a baseline \(b\) we have

\[\begin{align*} \sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,\sum_{a\in\mathcal{A}(s)}\pi_\theta(a,s)\,b(s)\,\nabla_\theta\ln\pi_\theta(a,s) &=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,b(s)\,\sum_{a\in\mathcal{A}(s)}\pi_\theta(a,s)\,\nabla_\theta\ln\pi_\theta(a,s)\\ &=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,b(s)\,\sum_{a\in\mathcal{A}(s)}\nabla_\theta\pi_\theta(a,s)\\ &=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,b(s)\,\nabla_\theta\left(\sum_{a\in\mathcal{A}(s)}\pi_\theta(a,s)\right)\\ &=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,b(s)\,\nabla_\theta 1\\ &=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,b(s)\cdot 0\\ &=0. \end{align*}\]

Thus, we may introduce a baseline in \(\nabla_\theta J(\pi_\theta)\) without changing the gradient:

\[\begin{equation*} \nabla_\theta J(\pi_\theta)=\sum_{s\in\mathcal{S}}\mu_{\pi_\theta}(s)\,\sum_{a\in\mathcal{A}(s)}\pi_\theta(a,s)\,\bigl(q_{\pi_\theta}(s,a)-b(s)\bigr)\,\nabla_\theta\ln\pi_\theta(a,s). \end{equation*}\]

A common choice is

\[\begin{equation*} b(s)=v_{\pi_\theta}(s), \end{equation*}\]

because then in REINFORCE and in the SARSA-based actor-critic method above the estimate for \(\nabla_\theta J(\pi_\theta)\) depends on the difference between new and old value estimate instead of on the value estimate itself.

To use this baseline with REINFORCE an additional state value function estimate has to be introduced in the algorithm. For the actor-critic method the baseline fits for naturally. The only change is the update of weights \(\theta\), which becomes

\[\begin{equation*} \theta+\alpha_2\,\big(r-\overline{r}+V_w(s')-V_w(s)\bigr)\nabla_\theta\ln\pi_\theta(a,s), \end{equation*}\]

showing even stronger parallels to the update of \(w\).

Policy Gradient Methods

Contents