Maximization of Return#

After modeling a reinforcement learning problem including proper definition of the agent, the environment and corresponding actions, states and rewards, our aim is to find a policy which maximizes return, that is, accumulated rewards. To find a good policy the agent has to follow two objectives:

  • exploration (collect information about the environment),

  • exploitation (use collected information to get high rewards and to maximize return).

There exist several principal approaches for finding good policies:

In this book we only consider value function methods.

Basic Idea of Value Function Methods#

Value function methods are based on scoring schemes (value functions) for states or state-action pairs. Rewards collected by the agent during exploration are used to compute or update scores for all states and actions seen and taken so far. Based on the scores furture actions are chosen. Typically, scores are estimates of expected return if corresponding action is chosen in corresponding state. The more data about the environment and its reward function is available (exploration!) the more accurate the estimates.

Depending on the size of state and action sets we distinguish between

  • tabular methods (small state and action sets, fitting into memory),

  • approximate methods (large state and/or action sets, scores cannot be kept in memory for all states/actions).

On-Policy and Off-Policy Methods#

The agent may know only one policy, which is used for exploration (what to do next to get more information about the environment?) and gets optimized to solve the desired task (exploitation) at the same time. Such methods are said to be on-policy.

Alternatively, the agent may have two policies. One policy controls the agent’s behavior during exploration. The other policy gets optimized and won’t be used before its good enough.

Action-Values vs. State-Values#

Scoring states and/actions in value function methods may follow one of two principal schemes:

  • Action-value methods: Score each state-action pair. Given a state choose the action with highest score.

  • State-value methods: Score each state. Given a state choose the action resulting in the state with highest score in all reachable states.

In principle, action-value methods canbe transformed into state-value methods and vice versa. But results obtained from both variants may be slightly different, because different actions (with different scores) may result in the same state (which has only one score) and, on the other hand, one action (with one score) may result in different states (with different scores) at different times (if environment has random features).

Standard Policies#

There exist two frequently used standard policies for value function methods: the greedy policy and the \(\varepsilon\)-greedy policy.

The greedy policy always chooses the action with highest score (for action value methods) or the action leading to the state with highest score (for state-value methods). There is no dedicated exploration behavior. Decisions are based on up to now collected information. Actions with high return in the short-run but low return in the long-run may be preferred to actions with low short-run but high long-run return, because actions looking good at early stage will be chosen again and again while missing alternatives with better long-run return.

The \(\varepsilon\)-greedy policy behaves like the greedy policy only with probability \(1-\varepsilon\) for some small fixed \(\varepsilon>0\). From time to time (with probability \(\varepsilon\)) it will choose a random action. Thus, the \(\varepsilon\)-greedy policy is able to find actions looking unfavorable in the short-run but may turn out to yield higher return than (at the moment) higher scored actions in the long-run. In this sense, there’s explicit exploration.

In the short-run the \(\varepsilon\)-greedy policy yields lower return than the greedy policy because some actions chosen by the \(\varepsilon\)-greedy policy are simply wrong (do not help in solving the task). But in the long-run return will be better than for the greedy policy because the environment gets explored more extensively yielding better options for solving the desired task.

Consider moving to a new city a looking for a route to walk to the university. If you follow a greedy policy you decide for the way which looks shortest at first glance (possibly found by a routing algorithm). You’ll never try a different way. If you instead follow an \(\varepsilon\)-greedy policy from time to time you try new routes. Most of them will be longer, but you may even find routes that are faster than the initial route (for instance, due to fewer traffic lights or otherwise simple/faster/fewer road crossings). In the long-run the \(\varepsilon\)-greedy approach will save you time.

The problem whether it’s better to spend more time for exploration or to fully exploit current knowledge without further exploitation is known as the exploration-exploitation dilemma.