Principal Approach#

Tabular value function methods are well suited for reinforcement learning tasks with few actions and not too many different states. The table assigning values (expected return) to each state-action pair has to fit into memory. But what about tasks with very large, maybe continuous or otherwise infinite action and/or state spaces? Here we cannot explicitely calculate and store estimates of expected returns for all state-action pairs.

Approximate value function methods use parametrized functions to represent value functions. Instead of estimating expected return for all state-action pairs, here we have to compute parameter values such that corresponding function yields a good estimate for expected return of all relevant state-action pairs. This way memory consumptions for storing a value function is independent of state and action space size.

All kinds of function approximation are available here: linear or polynomial functions, linear combinations of radial basis functions, decision trees, artificial neural networks (ANNs) and so on. In applications the main focus is on ANNs. Thus, instead of parameters we will use the word ‘weight’ to be consistent with the literature.

Challenges with Large State Spaces#

Large state spaces often induce further problems (next to memory consumption). Especially for continuous state spaces, values of neighboring states will heavily influence each other. On the one hand, thus we cannot handle states independently. On the other hand, this allows to deduce a state’s value from neighboring states’ values. With parametrized value functions we exploit this generalization property of state values, because there are much fewer parameters than states. Each parameter controls many states, not allowing for individual values.

Controlling the learning process, that is, exploring the environment and estimating state or state-action values, is much more difficult for large state and action spaces. The agent will explore only a tiny fraction of the whole environment. Thus, we have to take care that this tiny fraction is as relevant as possible for ‘understanding’ the environment’s structure.

Algorithm Structure#

Algorithms based on approximate value functions follow the same structure like algorithms based on tabular value functions. Starting with an initial policy the policy’s value function has to be estimated by exploring the environment (policy evaluation). Then, based on the value function estimate a new improved policy can be derived, an \(\varepsilon\)-greedy policy, for instance (policy improvement).

Like for tabular methods, policy evaluation and policy improvement can be combined in many different ways to obtain efficient algorithms. On-policy and off-policy approaches are available, too.