Bellman Equations and Optimal Policies#

Given complete knowledge of the environment dynamics $p$ we may compute value functions for policies without exploration by the agent. Further, complete knowledge allows to find (in some sense) optimal policies.

Computing Value Functions (Bellman Equations)#

Given a policy $\pi$, from the relations between state values and state-action values we immediately obtain equations

\[\begin{equation*} v_\pi(s)=\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl(r+\gamma\,v_\pi(s')\bigr) \end{equation*}\]

for all $s\in\mathcal{S}$ and

\[\begin{equation*} q_\pi(s,a)=\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\left(r+\gamma\,\sum_{a'\in\mathcal{A}(s')}\pi(a',s')\,q_\pi(s',a')\right) \end{equation*}\]

for all $s\in\mathcal{S}$ and all $a\in\mathcal{A}(s)$. These are the Bellman equations for state values and state-action values, respectively.

The Bellman equations for state values and state-action values both are systems of linear equations allowing to compute the value functions for all arguments without any need for exploration (if we know the environment dynamics $p$). In case of state values there are as many equations as there are states. In case of state-action values there are as many equations as there are state-action pairs.

The Bellman equations always have a solution, because for each policy there is a value function (the series in the explicit formula for value functions always converges). But could there be more than one solution? For $\gamma<1$ the answer is ‘no’ (see next section for a proof). For $\gamma=1$ uniqueness cannot be assured. Note that existence of a solution for $\gamma=1$ is only guaranteed for episodic tasks, not for continuing tasks.

Proof of Uniqueness of Solution to Bellman Equations#

Assume there are two solutions $v_1$ and $v_2$ to the Bellman equations for state values for some policy $\pi$. Then for each $s\in\mathcal{S}$ we have

\[\begin{align*} \bigl|v_1(s)-v_2(s)\bigr| &=\left|\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl(r+\gamma\,v_1(s')\bigr)\right.\\ &\qquad\left.-\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl(r+\gamma\,v_2(s')\bigr)\right|\\ &=\left|\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl(r+\gamma\,v_1(s')-r-\gamma\,v_2(s')\bigr)\right|\\ &=\gamma\,\left|\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl(v_1(s')-v_2(s')\bigr)\right|\\ &\leq\gamma\,\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl|v_1(s')-v_2(s')\bigr|\\ &\leq\gamma\,\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\max_{s''\in\mathcal{S}}\bigl|v_1(s'')-v_2(s'')\bigr|\\ &=\gamma\,\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,\max_{s''\in\mathcal{S}}\bigl|v_1(s'')-v_2(s'')\bigr|\\ &=\gamma\,\max_{s''\in\mathcal{S}}\bigl|v_1(s'')-v_2(s'')\bigr|. \end{align*}\]

Here we used that all $p$-values (for fixed $s$ and $a$) sum to 1 and that $\sum_a\pi(a,s)=1$. Now taking the maximum over all $s$ we see

\[\begin{equation*} \max_{s\in\mathcal{S}}\bigl|v_1(s)-v_2(s)\bigr|\leq\gamma\max_{s\in\mathcal{S}}\bigl|v_1(s)-v_2(s)\bigr|. \end{equation*}\]

But for $\gamma\in[0,1)$ this is only possible if $v_1(s)=v_2(s)$ for all $s$. Thus, there cannot be two different solutions to the Bellman equations.

Policy Improvement#

Given a policy we are able to compute corresponding value functions. Next step is to derive a better policy from the state or state-action value function. Here we say that a policy $\pi_1$ is at least as good as $\pi_2$ if

\[\begin{equation*} v_{\pi_1}(s)\geq v_{\pi_2}(s)\qquad\text{for all states $s\in\mathcal{S}$} \end{equation*}\]

(and it’s better if strict inequality holds at least for one state).

It turns out that a greedy policy with respect to the state value function $v_\pi$ always is at least as good as the original policy $\pi$. This result is known as the policy improvement theorem.

Note that the policy improvement theorem for episodic tasks with $\gamma=1$ only holds if both original policy and corresponding greedy policy always reach the end state within finitely many steps.

Proof of the Policy Improvement Theorem#

Proof of the policy improvement theorem is quite simple. We give it for continuing tasks, for episodic tasks it’s almost identical. Denote the greedy action with respect to $v_\pi$ for state $s$ by $a_{\mathrm{g}}(s)$ (if there are more than one greedy action, choose one of them). Then

\[\begin{equation*} q_\pi(s,a_{\mathrm{g}}(s))=\max_{a\in\mathcal{A}(s)}q_\pi(s,a)\qquad\text{for all $s\in\mathcal{S}$} \end{equation*}\]

and, thus,

\[\begin{align*} v_\pi(s) &=\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,q_\pi(s,a)\\ &\leq\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,q_\pi(s,a_{\mathrm{g}}(s))\\ &=q_\pi(s,a_{\mathrm{g}}(s)) \end{align*}\]

for all $s\in\mathcal{S}$. With this estimate for each $s\in\mathcal{S}$ we obtain

\[\begin{align*} v_\pi(s) &\leq q_\pi(s,a_{\mathrm{g}}(s))\\ &=\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a_{\mathrm{g}}(s))\,\bigl(r+\gamma\,v_\pi(s')\bigr)\\ &\leq\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a_{\mathrm{g}}(s))\,\bigl(r+\gamma\,q_\pi(s,a_{\mathrm{g}}(s'))\bigr)\\ &=\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a_{\mathrm{g}}(s))\,r+\gamma\,\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a_{\mathrm{g}}(s))\,q_\pi(s,a_{\mathrm{g}}(s'))\\ &=\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a_{\mathrm{g}}(s))\,r\\ &\qquad+\gamma\,\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a_{\mathrm{g}}(s))\,\sum_{s''\in\mathcal{S}}\sum_{r'\in\mathcal{R}}p(s'',r',s',a_{\mathrm{g}}(s'))\,\bigl(r'+\gamma\,v_\pi(s'')\bigr)\\ &=\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a_{\mathrm{g}}(s))\,r\\ &\qquad+\gamma\,\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}\sum_{s''\in\mathcal{S}}\sum_{r'\in\mathcal{R}}p(s',r,s,a_{\mathrm{g}}(s))\,p(s'',r',s',a_{\mathrm{g}}(s'))\,r'\\ &\qquad+\gamma^2\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}\sum_{s''\in\mathcal{S}}\sum_{r'\in\mathcal{R}}p(s',r,s,a_{\mathrm{g}}(s))\,p(s'',r',s',a_{\mathrm{g}}(s'))\,v_\pi(s'')\\ &\leq\ldots, \end{align*}\]

that is, an estimate of the form

\[\begin{align*} v_\pi(s) &\leq\text{expected reward if first action is greedy action}\\ &\qquad+\gamma\times\text{expected reward if first two actions are greedy}\\ &\qquad+\gamma^2\times\text{expected reward if first three actions are greedy}\\ &\qquad+\ldots. \end{align*}\]

The right-hand side is exactly the state value function of the greedy policies (w.r.t. $v_\pi$). Thus, each greedy policy is at least as good as the original policy $\pi$.

Note that from the mathematical point of view the transition above from the finite sum to the infinite sum be repeating the step again and again is dangerous. To have a complete and correct proof some details would have to be filled in there. From those details one would see the troubles caused by $\gamma=1$.

From the prove we also see, that a greedy policy is strictly better (in at least one state) if the original policy $\pi$ allows for non-greedy actions in some state, that is, if there are $s$ and $a$ such that $\pi(a,s)>0$ and $q_\pi(s,a)<q_\pi(s,a_{\mathrm{g}}(s))$.

Optimal Policies#

One can prove (see next section) that there always is a best policy, that is, a policy $\pi_\ast$ with

\[\begin{equation*} v_{\pi_\ast}(s)\geq v_\pi(s)\qquad\text{for all $s\in\mathcal{S}$ and for all policies $\pi$}. \end{equation*}\]

Thus, for each finite Markov decision process there is a best solution. This optimal policy yields higher return than any other policy regardless of the initial state of the environment.

In other words, it’s not possible to have two policies, one better in one state, the other better in another state, and both cannot be improved further. In such situations there always will be a third policy yielding higher return than those two policies.

As a consequence of the policy improvement theorem each optimal policy is a greedy policy w.r.t. its value function (else a corresponding greedy policy would be strictly better).

Given existence of an optimal policy $\pi_\ast$ the definition of optimal policies implies following formulas for the optimal policies’ value functions:

\[\begin{equation*} v_\ast(s):=v_{\pi_\ast}(s)=\max_{\pi}v_\pi(s)\qquad\text{for all $s\in\mathcal{S}$}, \end{equation*}\]

\[\begin{equation*} q_\ast(s,a):=q_{\pi_\ast}(s,a)=\max_{\pi}q_\pi(s,a)\qquad\text{for all $s\in\mathcal{S}$ and all $a\in\mathcal{A}(s)$}. \end{equation*}\]

Note that if there are more than one optimal policy, then all these optimal policies share one and the same value function $v_\ast$ or $q_\ast$ (by the definition of optimality). Thus, we may denote $v_\ast$ and $q_\ast$ as optimal value functions without referring to a concrete policy.

Proof of Existence of Optimal Policies#

There exist at least two different existence proofs for optimal policies, both requiring deeper knowledge of mathematics. Here we show the basic ideas only to showcase the power of abstract mathematics. One proof exploits the Banach fixed-point theorem, the other is heavily based on Zorn’s lemma, which is equivalent to the hotly debated axiom of choice. We follow the second variant.

The set of all policies has no linear order, because we cannot compare every policy to every other policy in terms of ‘better’ or ‘worse’ (pointwise for all states as introduced above). But we may find pairs of policies for which we can say that the one policy is better than the other. From such pairs we may form increasing chains of policies (from bad to good). If we can show that every such chain of policies has a maximal element (a best policy), then the Zorn lemma yields existence of a maximal element w.r.t. to the whole set of policies. That is, Zorn’s lemma then guarantees existence of an optimal policy.

How can we show that each chain of policies has a maximal element? For finite chains obviously the last element is the maximal one. For infinite chains we argue as follows:

Each infinite chain is a bounded set of functions (policies take value in $[0,1]$).
Bounded sets always contain a convergent sequence, say $\pi_1,\pi_2,\ldots$ (that’s the Bolzano-Weierstrass theorem).
Let $\pi$ be the pointwise limit of the sequence, that is

\[\begin{equation*} \pi(a,s):=\lim_{n\to\infty}\pi_n(a,s)\qquad\text{for all $s\in\mathcal{S}$ and all $a\in\mathcal{A}(s)$}. \end{equation*}\]
Now $\pi$ is a policy, too (values in $[0,1]$, sum over actions is 1).
The sequence of corresponding state value functions $v_1,v_2,\ldots$ converges, too (bounded increasing sequences always converge). Denote the pointwise limit by $v$.
From the Bellman equations we see that $v$ is the value function of $\pi$ (swap sums and limits in Bellman equations).
Thus, $\pi$ is a better policy than policies $\pi_1,\pi_2,\ldots$ and, consequently, $\pi$ is better than all policies in the chain under consideration.

Computing Optimal Policies (Optimal Bellman Equations)#

Above we met the Bellman equations for computing value functions for given policies. Optimal policies all share identical state and state-action value functions. Thus, we should try to find Bellman equations for computing optimal value functions without referring to a conrete policy. From the optimal value function then we easily obtain an optimal policy: a greedy policy w.r.t. the optimal value function (this is a simple consequence of the policy improvement theorem).

The key ingredient for computing optimal value functions is the equation

\[\begin{equation*} v_\ast(s)=\max_{a\in\mathcal{A(s)}}q_\ast(s,a)\qquad\text{for all $s\in\mathcal{S}$}. \end{equation*}\]

This equation is, again, a simple consequence of the policy improvement theorem, because there is an optimal greedy policy w.r.t. to $v_\ast$ and for this greedy policy the equation is obviously true. Combining this equation with the equations relating state and state-action values (cf. derivation of Bellman equations) we obtain

\[\begin{equation*} v_\ast(s)=\max_{a\in\mathcal{A}(s)}\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl(r+\gamma\,v_\ast(s')\bigr)\qquad\text{for all $s\in\mathcal{S}$}, \end{equation*}\]

\[\begin{equation*} q_\ast(s,a)=\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\left(r+\gamma\,\max_{a'\in\mathcal{A}(s')}q_\ast(s',a')\right)\qquad\text{for all $s\in\mathcal{S}$ and all $a\in\mathcal{A}(s)$}. \end{equation*}\]

These are the optimal Bellman equations for state values and for state-action values in case of continuing tasks. For episodic tasks replace $\gamma$ by 1. Note that the optimal Bellman equations are identical to the Bellman equations for the greedy policy w.r.t. the optimal value function.

For deriving the optimal Bellman equations we did not use optimality of $v_\ast$ directly. Instead we only needed that the underlying policy is greedy with respect to its own value function. Might there be non-optimal policies with this property, too? If yes, the optimal Bellman equations would have multiple solutions, some of them being non-optimal value functions. Luckily, the answer is ‘no’ (see next section for a proof). This uniqueness result for the optimal Bellman equations also implies that every policy which is greedy w.r.t. to its own value function has to be optimal.

The optimal Bellman equations are a system of nonlinear equations, which cannot be solved directly. Instead, (iterative) methods for solving nonlinear equations have to be used. We know, that there always is a solution of this system, because we’ve proven existence of an optimal policy above.

If we know the environment dynamics $p$, then the optimal Bellman equations allow for numerically computing an optimal policy without the need for exploration. If we do not know $p$, we have to look for approximate solutions to the optimal Bellman equations based on data collected by the agent.

Proof of Uniqueness of Solution to Optimal Bellman Equations#

For the proof we need the inequality

\[\begin{equation*} \left|\max_x f(x)-\max_x g(x)\right|\leq\max_x\bigl|f(x)-g(x)\bigr| \end{equation*}\]

for arbitrary functions $f$ and $g$ taking arguments $x$ from a finite set. To see that this inequality is true, assume $\max_x f(x)\geq\max_x g(x)$ (else, switch the roles of $f$ and $g$) and let $\bar{x}$ be a maximizer of $f(x)$. Then

\[\begin{align*} \left|\max_x f(x)-\max_x g(x)\right| &=f(\bar{x})-\max_x g(x)\\ &\leq f(\bar{x})-g(\bar{x})\\ &\leq\max_x\bigl|f(x)-g(x)\bigr|. \end{align*}\]

Now assume there are two solutions $v_1$ and $v_2$ to the optimal Bellman equations for state values. Then for each $s\in\mathcal{S}$ we have

\[\begin{align*} \bigl|v_1(s)-v_2(s)\bigr| &=\left|\max_{a\in\mathcal{A}(s)}\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl(r+\gamma\,v_1(s')\bigr)\right.\\ &\qquad\left.-\max_{a\in\mathcal{A}(s)}\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl(r+\gamma\,v_2(s')\bigr)\right|\\ &\leq\max_{a\in\mathcal{A}(s)}\left|\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl(r+\gamma\,v_1(s')-r-\gamma\,v_2(s')\bigr)\right|\\ &=\gamma\,\max_{a\in\mathcal{A}(s)}\left|\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl(v_1(s')-v_2(s')\bigr)\right|\\ &\leq\gamma\,\max_{a\in\mathcal{A}(s)}\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl|v_1(s')-v_2(s')\bigr|\\ &\leq\gamma\,\max_{a\in\mathcal{A}(s)}\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\max_{s''\in\mathcal{S}}\bigl|v_1(s'')-v_2(s'')\bigr|\\ &=\gamma\,\max_{a\in\mathcal{A}(s)}\max_{s''\in\mathcal{S}}\bigl|v_1(s'')-v_2(s'')\bigr|\\ &=\gamma\,\max_{s''\in\mathcal{S}}\bigl|v_1(s'')-v_2(s'')\bigr|. \end{align*}\]

Here we used that all $p$-values (for fixed $s$ and $a$) sum to 1. Now taking the maximum over all $s$ we see