Dynamic Programming#

The term ‘dynamic programming’ is somewhat missleading. Here ‘programming’ is to be read as ‘optimization’ and ‘dynamic’ emphasizes the fact that we won’t solve one large optimization problem but a sequence of smaller ones yielding the a solution to the overall problem.

The general considerations in Bellman Equations and Optimal Policies give rise to two concrete algorithms:

  • Use the Bellman equations and the policy improvement theorem to iteratively improve an initial policy (policy iteration).

  • Solve the optimal Bellman equations to directly obtain an optimal policy (value iteration).

Both algorithms assume, that we have complete knowledge of the environment dynamics \(p\), which is rarely seen in practise (except for grid worlds). But these two algorithms will be the starting point for developing data-driven (exploration!) variants of them in subsequent chapters. Almost all reinforcement learning algorithms we consider in this book will look quite similar to policy or value iteration.

Related projects:

Policy Iteration#

Based on the Bellman equations and the policy improvement theorem we may assemble the following algorithm:

  1. Choose a random initial policy.

  2. Solve corresponding Bellman equations.

  3. Replace current policy by a greedy policy w.r.t. the solution of 2.

  4. Go to 2 as long as the policy has changed.

Step 2 is referred to as policy evaluation. Step 3 is the policy improvement step. The stopping criterion will be satisfied if and only if the policy is optimal (remember that a policy is optimal if and only if it’s a greedy policy w.r.t. its own value function).

To solve the Bellman equations in step 2 we could create the coefficient matrix for the system of linear equations and use some numerical solver. For large state spaces this matrix requires lots of memory, but most entries are zero. Employing sparse matrix representations and specialized solvers may help, but there’s also a very simple and suffienctly efficient algorithm for solving Bellman equations based on the concrete structure of the system.

The structure of the Bellman equations for state values is

\[\begin{equation*} \underline{v}_\pi=B_\pi(\underline{v}_\pi), \end{equation*}\]

where \(\underline{v}_\pi\in\mathbb{R}^{|\mathcal{S}|}\) is the vector of values for all states and \(B_\pi\) multiplies its argument by a \(\mathbb{R}^{|\mathcal{S}|\times|\mathcal{S}|}\)-matrix and then adds a constant vector to the result. In other words, the value vector \(\underline{v}_\pi\) for \(\pi\) is a fixed point of \(B_\pi\). From the proof of uniqueness of the solution we easily see that for an arbitrary vector \(\underline{v}\) we have

\[\begin{equation*} |\underline{v}_\pi-B_\pi(\underline{v})|_\infty=|B_\pi(\underline{v}_\pi)-B_\pi(\underline{v})|_\infty\leq\gamma\,|\underline{v}_\pi-\underline{v}|_\infty. \end{equation*}\]

Say we start with some vector \(\underline{v}_0\) and repeatedly apply \(B_\pi\) to obtain a sequence \(\underline{v}_1,\underline{v}_2,\ldots\), then we have

\[\begin{equation*} |\underline{v}_\pi-\underline{v}_k|_\infty\leq\gamma^k\,|\underline{v}_\pi-\underline{v}_0|_\infty \end{equation*}\]

and \(\gamma^k\to 0\) if \(k\to\infty\). This observation yields the following algorithm for (approximately) solving the Bellman equations in step 2 above:

  1. Choose an initial value vector \(\underline{v}_0\) (random, all zero,…).

  2. For \(k=1,2,\ldots\) iteratively compute \(\underline{v}_k:=B_\pi(\underline{v}_{k-1})\).

  3. Stop the iteration if \(|\underline{v}_k-\underline{v}_{k-1}|_\infty\leq\delta\) for some preset bound \(\delta>0\).

This algorithm always stops after finitely many steps.

Value Iteration#

Alternatively to policy iteration we may solve the optimal Bellman equations to obtain an optimal policy more directly. The optimal Bellman equations are a system of nonlinear equations, but the solution vector \(\underline{v}_\ast\) is a fixed point again:

\[\begin{equation*} \underline{v}_\ast=B(\underline{v}_\ast), \end{equation*}\]

where \(B\) is the mapping defined by the right-hand side of the optimal Bellman equations. In complete analogy to solving the Bellman equations for policy iteration above we obtain the following algorithm:

  1. Choose an initial value vector \(\underline{v}_0\) (random, all zero,…).

  2. For \(k=1,2,\ldots\) iteratively compute \(\underline{v}_k:=B(\underline{v}_{k-1})\).

  3. Stop the iteration if \(|\underline{v}_k-\underline{v}_{k-1}|_\infty\leq\delta\) for some preset bound \(\delta>0\).

This yields (an approximation of) the optimal value function \(v_\ast\). Each greedy policy w.r.t. this value function is an optimal policy.

Efficiency Considerations#

Fixed-point iteration for systems of linear equations usually converges faster to the solution than for nonlinear equations. Policy iteration solves one system of linear equations per iteration, but often requires only a handful of iterations. Value iteration takes much more steps, but each step is very cheap (apply \(B\)). Thus, it’s not clear which one is preferable.

For both algorithms we may increase efficiency with a simple trick. We may update \(\underline{v}\) in-place and componentwise:

\[\begin{equation*} \underline{v}^{(l)}:=[B(\underline{v})]_l\qquad\text{for $l=1,\ldots,|\mathcal{S}|$}. \end{equation*}\]

More precisely:

\[\begin{align*} \underline{v}_k^{(1)}&:=[B(\underline{v}_{k-1})]^{(1)},\\ \underline{v}_k^{(l)}&:=\left[B\left(\begin{bmatrix}\underline{v}_k^{(1)}\\\vdots\\\underline{v}_k^{(l-1)}\\\underline{v}_{k-1}^{(l)}\\\vdots\\\underline{v}_{k-1}^{(|\mathcal{S}|)}\end{bmatrix}\right)\right]^{(l)}\qquad\text{for $l=2,\ldots,|\mathcal{S}|$}. \end{align*}\]

Thus, already updated components are directly used for updating the next component instead of holding the new values back until all components have been updated. As a side effect we do not need to store two vectors in memory, but only one, which may become relevant in case of large state spaces. Fixed-point iteration with this update rule is known as asynchronous policy evaluation.

Both algorithms’ computation time is polynomial in the number of states and actions. Thus, for large state and/or action spaces they are very slow, but much faster than exhaustively searching the whole policy space.