Temporal Difference Learning (TD Learning)#

Related projects:

With Monte Carlo methods we have to wait till the end of the first episode before the agent learns anything. Especially for tasks with very long episodes it would be preferable if the agent could already learn something during the episode. Take a grid world with walls, where the agent obtains positive reward for reaching a destination cell and negative reward for hitting a wall. Learning about the location of the destination cell requires at least one full episode. But learning about the location of the walls (and how to avoid hitting them) should be possible during an episode. Thus, the agent souldn’t hit the same wall more than one or two times. This is possible with temporal difference learning.

The overall setting is as usual: alternate policy evaluation and policy improvement. The difference is in the policy evaluation step, that is, in estimating the action value function \(q_\pi\).

Policy Evaluation#

If we have a trace \(s_0,a_0,r_1,s_1,a_1,r_2,s_2,\ldots\), immediately after deciding for the next action \(a_{t+1}\) and obtaining the reward \(r_{t+1}\) the estimate \(Q_\pi(s_t,a_t)\) for \(q_\pi(s_t,a_t)\) is updated via

\[\begin{equation*} Q_\pi(s_t,a_t):=Q_\pi(s_t,a_t)+\alpha\,\bigl(r_{t+1}+\gamma\,Q_\pi(s_{t+1},a_{t+1})-Q_\pi(s_t,a_t)\bigr). \end{equation*}\]

Here \(\alpha\in(0,1]\) is a step size and \(r_{t+1}+\gamma\,Q_\pi(s_{t+1},a_{t+1})\) is an estimate for the return \(q_\pi(s_t,a_t)\).

This value function update works for both episodic and continuing tasks. One can show that for fixed policy \(\pi\) the estimate \(Q_\pi\) converges to \(q_\pi\) if \(\alpha\) is chosen small enough and if each state-action pair gets visited infinitely many times.

Note that for TD learning each end state in an episodic task has to allow for at least one action, else there is no \(Q\)-value to estimate the value from for the state-action pair leading to the end state. But this is more or less an implementation detail only.

In TD learning the rewards flow through the set of state-action pairs in reverse direction of the traces. Take a 1-by-5 grid world, for instance. The agent always starts at the left most position and the goal is at the right most position. The policy is to always go to the right and never to the left. Reward for reaching the goal is 1. All other rewards are zero. The table below shows the development of state-action values for \(\gamma=1\) and \(\alpha=0.5\). Note that the agent does not collect any information about what happens if it goes to the left. Thus, we have no estimates for corresponding state-action pairs.

1-by-5 grid world

Fig. 79 The agent always moves from the left to the right.#

episode

new state

\(Q_\pi(1,\mathrm{R})\)

\(Q_\pi(2,\mathrm{R})\)

\(Q_\pi(3,\mathrm{R})\)

\(Q_\pi(4,\mathrm{R})\)

\(Q_\pi(5,\mathrm{R})\)

-

-

0

0

0

0

0

1

2

0

0

0

0

0

1

3

0

0

0

0

0

1

4

0

0

0

0

0

1

5

0

0

0

0.5

0

2

2

0

0

0

0.5

0

2

3

0

0

0

0.5

0

2

4

0

0

0.25

0.5

0

2

5

0

0

0.25

0.75

0

3

2

0

0

0.25

0.75

0

3

3

0

0.125

0.25

0.75

0

3

4

0

0.125

0.625

0.75

0

3

5

0

0.125

0.625

0.875

0

4

2

0.0625

0.125

0.625

0.875

0

0

For TD learning to work the initial estimate for \(Q_\pi\) has to be zero for all end states.

SARSA (On-Policy TD Learning)#

The SARSA algorithm alternates policy evaluation and policy improvement, but improves the policy after each step of above policy evaluation iteration instead of waiting for convergence of the value function. To ensure sufficient exploration an \(\varepsilon\)-greedy policy based on the current value function estimate is used. The name SARSA reflects the fact, that the algorithm requires \(s_t,a_t,r_{t+1},s_{t+1},a_{t+1}\) for the value function update.

  1. Choose \(\varepsilon>0\) and \(\alpha\in(0,1]\).

  2. Choose initial \(\pi\) (uniformly random, for instance) and initial \(Q_\pi\) (all zero).

  3. Run episodes (for continuing tasks there’s only one episode):

    1. Choose initial state \(s\) and initial action \(a\).

    2. For each step do:

      1. Take action \(a\) and observe reward \(r\) and new state \(s'\).

      2. Choose \(a'\) according to \(\pi\).

      3. Replace \(Q_\pi(s,a)\) by \(Q_\pi(s,a)+\alpha\,(r+\gamma\,Q_\pi(s',a')-Q_\pi(s,a))\).

      4. Set \(\pi\) to be \(\varepsilon\)-greedy w.r.t. \(Q\).

      5. Replace \((s,a)\) by \((s',a')\).

      6. If \(s'\) is an end state, go to next episode.

The exploration rate \(\varepsilon\) should be decreased over time.

Q-Learning (Off-Policy TD Learning)#

SARSA updates \(Q_\pi(s,a)\) by looking at the value of the next state-action pair \((s',a')\) the current (non-optimal) policy suggests. Because the policy is not greedy (exploration!) \(a'\) won’t allways be the highest rated action. But for the update of \(Q_\pi\) we do not have to stick to the (non-optimal) behavior suggested by \(\pi\). Instead we could simply use the best follow-up action to estimate current state-action pair’s value:

\[\begin{equation*} Q(s,a):=Q(s,a)+\alpha\,\left(r+\gamma\,\max_{\tilde{a}\in\mathcal{A}(s')}Q(s',\tilde{a})-Q(s,a)\right). \end{equation*}\]

Note that we dropped the subscript \(\pi\) because \(Q\) no longer approximates the action values of the \(\varepsilon\)-greedy policy \(\pi\) used by the agent for choosing the next action. \(Q\) now approximates action values of the greedy policy w.r.t. \(Q\) from the previous iteration. SARSA with this modified update formula is an off-policy method denoted as Q-learning. Convergence of \(Q\) to an optimal value function typically is faster than for on-policy SARSA.

Double Q-Learning#

Q-Learning tends to overestimate action values. SARSA at least in some steps chooses non-optimal actions. Thus, \(Q(s',a')\) in the estimated \(r+\gamma\,Q(s',a')\) sometimes is below the average and sometimes above. But Q-learning always updates with the highest estimated (!) \(Q\) value. Thus, estimates will be greater than the true action values on average.

To overcome this problem we should use two estimates \(Q_1\) and \(Q_2\). One of them is used to choose the optimal action. The other then is used to determine the value of the chosen action. Alternating both roles yields a technique which averages two estimates of the optimal action value. Thus, there’s some chance that one estimate is below and the other above the true value.

Double Q-learning looks like Q-learning, but in each step we randomly choose one of two possible updates:

\[\begin{equation*} Q_1(s,a):=Q_1(s,a)+\alpha\,\left(r+\gamma\,Q_1\left(s',\underset{\tilde{a}\in\mathcal{A}(s')}{\mathrm{argmax}}\,Q_2(s',\tilde{a})\right)-Q_1(s,a)\right) \end{equation*}\]

or

\[\begin{equation*} Q_2(s,a):=Q_2(s,a)+\alpha\,\left(r+\gamma\,Q_2\left(s',\underset{\tilde{a}\in\mathcal{A}(s')}{\mathrm{argmax}}\,Q_1(s',\tilde{a})\right)-Q_2(s,a)\right). \end{equation*}\]

The \(\varepsilon\)-greedy policy then is based on the average of both \(Q_1\) and \(Q_2\).

Note that the idea of double learning also works for SARSA, but SARSA does not tend to overestimate action values as much as Q-learning does.