Basic Notions#

Finite Markov Decision Processes#

Decision processes are processes which can be modeled by agent-environment interactions. The agent takes actions according to a policy and the environment provides feedback (state and reward) to the agent. With discrete time steps $t=0,1,\ldots$ this interaction yields a trajectory

\[\begin{equation*} s_0,a_0,r_1,s_1,a_1,r_2,s_2,a_2,\ldots, \end{equation*}\]

where $a_t$, $r_t$, $s_t$ denote actions taken, rewards obtained, and states observed, respectively. Here $s_0$ is the environment’s initial state and $a_0$ is the agent’s first action yielding the reward $r_1$ and transforming the environment into state $s_1$. Note that there are continuous decision processes, too, but we restrict our attention the discrete setting without mentioning this every time.

Such a decision process is called a Markov decision processes (MDP) if it has the following property, known as Markov property: At each time step $t$ obtained reward $r_t$ and current state $s_t$ only depend on the previous state $s_{t-1}$ and on the previous action $a_{t-1}$ (and chance). In other words: if we know $s_{t-1}$ and $a_{t-1}$, then we also know $r_t$ and $s_t$ (up to random influences).

The Markov property ensures that state information is sufficiently rich. The environment does not have to look at historical developments to find the next state based on the current action. The Markov property can always be satisfied by appropriate modeling of the environment.

A finite Markov decision process is a Markov decision process with finite state set, finite action set, and a finite number of different reward values.

Below $\mathcal{S}$ denotes the state set, $\mathcal{A(s)}$ denotes the set of actions feasible in state $s$, and $\mathcal{R}\subseteq\mathbb{R}$ denotes the set of reward values.

Examples and Counter Examples#

Continuously evolving processes like many physical phenomena aren’t (discrete) Markov decision processes, because they lack discrete time steps. Processes with continuous time cannot be solved appropriately on a computer. From the analytical point of view continuous time is not a problem. To solve continuous time processes like driving a car with typical reinforcement learning algorithms we have to find ways for discretizing time.

Imagine a grid world in which the agent obtains an extra reward if it reaches the destination cell without touching a wall on its path to the destination cell. This model does not have the Markov property, because the reward cannot be determined from previous state and action. Instead, we would have to look at all previous states to determine whether the agent deserves the extra reward.

Similarly, the environment could react to the action ‘go to the right’ by moving the agent by one cell if it hit a wall some steps before and by two cells if the agent didn’t hit a wall up to now. Then again the Markov property is violated because the next state (position of agent) cannot be determined from previous state and action. Instead, the environment would have to look at all previous states to determine the next state.

The grid world counter example can be turned into a Markov decision process by extending state information. The environment could have a flag in its state telling us whether the agent touched the wall or not. Reaching the destination cell, extra reward is determined by current state information.

An example for non-finite Markov decision processes is autonomous driving. Turning the steering wheel by an angle can be modeled as a set of infinitely many actions (one per angle). Most infinite action sets can be turned into finite sets by discretization. Instead of allowing for arbitrary angles we could restrict the action set to integer angles, for instance.

Environment Dynamics#

Almost always we consider stochastic environments, that is, identical behavior of the agent may lead to different feedback (state, reward) from the environment at different times. The environment’s behavior (‘dynamics’) is completely described by values

\[\begin{equation*} p(s',r,s,a)\in[0,1], \end{equation*}\]

which are to be interpreted as probability that state $s'$ and reward $r$ are observed after taking action $a$ in state $s$.

If we knew $p$ for all combinations of arguments, we could solve corresponding reinforcement learning problem without exploration. But usually we don’t know $p$. An example for complete knowledge of $p$ are simple grid worlds. Here $p$ encodes the map of the grid world.

For finite Markov decision processes $p$ can be represented by a (four dimensional) table with finitely many cells.

Goals and Return#

Reinforcement learning algorithms are designed to maximize accumulated rewards in the long-run. Thus, defining sensible rewards is the only way to direct the agent to a learning task’s goal! Accumulated rewards are denoted as return. Return obtained after time step $t$ will be denoted by $g_t$.

For episodic tasks with $T$ time steps return is

\[\begin{align*} g_t&:=r_{t+1}+r_{t+2}+\cdots+r_T\\ &=r_{t+1}+g_{t+1}. \end{align*}\]

For continuing tasks a simple sum would have infinitely many summands and (the infinitely many) rewards in far future would outshine the (finitely many) near-future rewards. Thus, for continuing tasks we should use discounting:

\[\begin{align*} g_t&:=r_{t+1}+\gamma\,r_{t+2}+\gamma^2\,r_{t+3}+\gamma^3\,r_{t+4}+\ldots\\ &=r_{t+1}+\gamma\,g_{t+1}. \end{align*}\]

Here $\gamma\in[0,1)$ is the discounting parameter. Small $\gamma$ puts more weight on near future, whereas $\gamma$ close to one also includes rewards obtained in far future. An alternative is to use the simple sum of a fixed number of future rewards.

For finding good policies we aren’t interested in concrete returns obtained by the agent in past episodes, but in expected return. That is, we want to forecast average return over many episodes (episodic tasks) or long time spans (continuing tasks) obtained by following the policy under consideration. This is possible if we have complete knowledge of the environment dynamics. But often we do not have complete knowledge of environment dynamics. Then we have to find good estimates for expected return to properly direct the agent.

Policies and Value Functions#

A policy describes the agent’s behavior by mapping states to actions. Each state-action pair a probability-like score $\pi(a,s)$ is assigned to, which can be interpreted as probability that action $a$ is chosen in state $s$. The goal of developing reinforcement learning algorithms is to find good policies (in terms of expected return).

Value functions express expected returns for all actions and/or states:

The state-value function $v_\pi$ with respect to a policy $\pi$ assigns to each state $s$ the expected return when starting from $s$ and following $\pi$.
The action-value function $q_\pi$ with respect to a policy $\pi$ assigns to each state-action pair $(s,a)$ the expected return when starting from $s$ with action $a$, then following $\pi$.

Writing down concrete formulas for value functions is quite difficult. We start with two implicit formulas showing the relation between state values and state-action values ($p$ encodes environment dynamics, see above):

The expected return in state $s$ is the mean (weighted by probability) of expected returns obtained from all possible actions in state $s$:

\[\begin{equation*} v_\pi(s)=\sum_{a\in\mathcal{A}(s)}\pi(a,s)\,q_\pi(s,a)\qquad\text{for all $s\in\mathcal{S}$}. \end{equation*}\]
The expected return for the state-action pair $(s,a)$ is the mean (weighted by probability) of expected returns of all possible follow-up states:

\[\begin{equation*} q_\pi(s,a)=\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}p(s',r,s,a)\,\bigl(r+\gamma\,v_\pi(s')\bigr)\qquad\text{for all $s\in\mathcal{S}$ and all $a\in\mathcal{A}(s)$}. \end{equation*}\]

This formula is for continuing tasks only. For episodic tasks replace $\gamma$ by 1.

From these equations we see, that computing state values from state-action values is always possible. But computing state-action values from state values requires knowledge of the environment dynamics.

The explicit formula for $v_\pi(s)$ has the structure

\[\begin{align*} v_\pi(s)&=\text{expected reward after first action}\\ &\qquad+\gamma\times\text{expected reward after second action}\\ &\qquad+\gamma^2\times\text{expected reward after third action}\\ &\qquad+\ldots. \end{align*}\]

For continuing tasks this sum has infinitely many summands, but finite value (expected rewards are bounded and $1+\gamma+\gamma^2+\ldots$ is a geometric series). For episodic tasks the sum has finitely many summands (and $\gamma=1$). Making ‘expected reward after … action’ explicit we obtain

\[\begin{align*} v_\pi(s) &=\sum_{a\in\mathcal{A}(s)}\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}\pi(a,s)\,p(s',r,s,a)\,r\\ &\qquad+\gamma\,\sum_{a\in\mathcal{A}(s)}\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}\sum_{a'\in\mathcal{A}(s')}\sum_{s''\in\mathcal{S}}\sum_{r'\in\mathcal{R}}\pi(a,s)\,p(s',r,s,a)\,\pi(a',s')\,p(s'',r',s',a')\,r'\\ &\qquad+\gamma^2\,\sum_{a\in\mathcal{A}(s)}\sum_{s'\in\mathcal{S}}\sum_{r\in\mathcal{R}}\sum_{a'\in\mathcal{A}(s')}\sum_{s''\in\mathcal{S}}\sum_{r'\in\mathcal{R}}\sum_{a''\in\mathcal{A}(s'')}\sum_{s'''\in\mathcal{S}}\sum_{r''\in\mathcal{R}}\pi(a,s)\,p(s',r,s,a)\,\pi(a',s')\,p(s'',r',s',a')\,\pi(a'',s'')\,p(s''',r'',s'',a'')\,r''\\ &\qquad+\ldots. \end{align*}\]

For $q_\pi(s,a)$ the explicit formula looks almost identical, simply remove $\sum_{a\in\mathcal{A}(s)}$ and $\pi(a,s)$ in each line.

On the one hand, value functions provide information about a policy’s quality. On the other hand, value functions can be used to define policies:

A greedy policy with respect to a value function only chooses actions with highest value (in case of an action-value function) or actions leading to a state with highest value (in case of a state-value function).
An $\varepsilon$-greedy policy with respect to a value function behaves like a greedy policy with probability $1-\varepsilon$ and chooses a random action with probability $\varepsilon$.

Value function based reinforcement learning methods estimate value functions from information the agent gathers over time (exploration) resulting in a (value function based) policy solving the learning task (exploitation). If during exploration the agent is already controlled by the (iteratively changing) value function based policy to be optimized, the method is said to be on-policy (same policy for exploration and exploitation). If during exploration the agent is controlled by another policy (the random policy, for instance), then the method is off-policy (different policies for exploration and exploitation).

Whether to use state-values or action-values to define a policy depends on the concrete use case. Action value functions (or estimates thereof) require more memory, because they have to store expected returns for all pairs of states and actions. State value functions require less memory, but to choose an action based on state values we have to predict the next state, which may be impossible without complete knowledge of the environment.

Basic Notions

Contents