Deep Q-Learning#

Tabular Q-learning works like tabular SARSA but uses the greedy action’s value instead of the chosen action’s value for updating the value function estimate \(Q\). This makes Q-learning an off-policy method, because \(Q\) approximates the value function of its greedy policy while for exploration an \(\varepsilon\)-greedy policy is used.

In principle, Q-learning with approximate value functions can be implemented like SARSA for approximate value functions. The update formula then is

\[\begin{equation*} w+\alpha\,\bigl(r+\gamma\,\max_{a'\in\mathcal{A}(s')}Q_w(s',a')-Q_w(s,a)\bigr)\,\nabla_w\,Q_w(s,a). \end{equation*}\]

But it turns out that this doesn’t work in a satisfactory manner. In 2015 DeepMind (now Google DeepMind) suggested a successful Q-learning variant known as deep Q-learning.

Deep Q-learning implements some special features, discussed below in more detail:

  • more stable gradient descent,

  • more efficient ANN structure,

  • less correlated training samples,

  • reuse of training samples.

Two ANNs#

Approximate value function methods solve a supervised learning problem via gradient descent. Training samples consist of state-action pairs (inputs) and estimates for expected returns (outputs). During training one and the same state-action pair may appear with many different outputs, because estimates for expected return will improve over time. Thus, labels of the training samples for supervised learning are very noisy, which typically yields poor results in combniation with gradient descent. Convergence often is slow and some gradient steps destroy the progress of previous steps. Although this is a problem for SARSA, too, it seems to be much more severe for Q-learning.

The deep Q-learning algorithm tries to stabilize the minimization procedure by using two ANNs, a prediction ANN and a target ANN. Both ANNs represent estimates for expected return. Denote the value function represented by the prediction ANN by \(Q_w\) and the one represented by the target ANN by \(Q_{w'}\).

\(Q_w\) is updated after each step as usual. \(Q_{w'}\) is only updated every \(C\) steps by copying current weights \(w\) of the prediction ANN. Here \(C\) is a parameter of the overall algorithm. The update formula for \(Q_w\) is

\[\begin{equation*} w+\alpha\,\bigl(r+\gamma\,\max_{a'\in\mathcal{A}(s')}{\color{red}Q_{w'}(s',a')}-Q_w(s,a)\bigr)\,\nabla_w\,Q_w(s,a), \end{equation*}\]

that is, \(Q_{w'}\) is used for computing expected return only. In other words, \(Q_{w'}\) determines the outputs of the training samples. Because \(Q_{w'}\) changes only every \(C\) steps, training samples now are less noisy and gradient descent tends to behave more stably.

There’s no proof that this procedure always works. But experience shows that performance of gradient descent greatly improves.

Note that in deep Q-learning the ANN structure is slightly different than for SARSA. Only the state is used as input. On the output side one has as many neurons as there are actions. The advantage of this structure is that for getting values for all actions we have to evaluate the ANN only once, saving computation time.

Experience Replay#

Especially for reinforcement learning tasks with continuous state space (autonomous driving, for instance) training samples of neighboring time steps look very similar. But for supervised learning we want to have very different, uncorrelated samples. Else learning progress will be slow.

One solution to this problem is to train several agents in parallel, each in its own environment, but using a common \(Q\) function. Then observations obtained from the agents will be quite different.

Deep Q-learning follows another approach, known as experience replay. Training samples are kept in memory and are reused many times. On the one hand, this solves the correlation problem to some extent, because older samples will differ from newer samples. On the other hand, reusing training samples increases information gain from explored data. Without experience replay each sample would be used for one gradient step only. With experience replay there will be many gradient steps per sample. In each gradient step we may now provide a batch of samples to the ANN, which significantly increases convergence speed with only little additional computational effort.

Training samples are kept in a fixed-size replay buffer. For each time step we have to store the tuple \((s,a,r,s')\). If the replay buffer is full, the oldest training sample is removed. For each gradient step we take a random batch of samples from the replay buffer.