Examples#

Reinforcement learning has a wide range of applications from simple board games to autonomous robots. The simpler ones, especially board games, are good toy examples for testing and understanding important concepts.

To describe how some task can be solved by reinforcement learning we have to specify

  • the environment,

  • the agent,

  • the set of states the environment can attain (or the agent’s sensors can record),

  • the set of actions available to the agent,

  • calculation of rewards.

Further we may specify whether

  • there is an end state with no more valid actions (episodic task) or

  • there is no end state (continuing task).

Information about the environment (state) may be

  • complete (observed state contains all relevant information) or

  • incomplete (observed state does not contain all relevant information).

Board and Card Games in General#

Environment: The board and all other material of the game. In some other players (humans or AI) may be relevant for making decisions. Then they belong to the environment, too. That’s the especially the case for games where mimics of other players may reveal secret information.

Agent: The computer player.

States: Current board situation and all other relevant information provided by the environment.

Actions: Every allowed move. If the the agent does not know the games rules completely, but shall learn the rules, then moves not allowed by the rules may belong to the set of actions the agent may take.

Rewards: Depends on the game. In the simplest case the reward is 1 if the action yields immediate victory and 0 else. Other reward functions may also honour moves yielding an in some sense advantageous situation. For some games the aim is to collect as many rewards as possible, thus, there’s a canonical reward function for reinforcement learning (Carcassonne,…).

Board and card games are episodic (there is an end state).

For some games information about the environment is complete (chess, connect four, Mensch ärgere dich nicht,…). For others information is incomplete (most card games, Scotland Yard,…).

In board and card games almost always there are only finitely many actions and states.

Autonomous Driving#

Environment: Real world including pedestrians, other cars, butterflies,…

Agent: Computer/controler driving the car.

States: Everything the agent can observer (sensor data).

Actions: Signals to actors, like braking, steering commands,…

Reward: E.g. -1 if crash, 1 if destination reached,…

Autonomous driving often is a continuing task. There is no end state. The agent shall work forever.

Information about the environment is incomplete because the environment is too complex for storing its state in a digital computer (resolution of camera images,…).

The set of state set is infinite. The set of actions almost always is infinite (some freedom of modelling here, discretization).

Grid World#

Grid worlds are discrete and heavily simplified variants of autonomous driving settings. A rectangular grid of cells defines possible locations for the agent or objects. Cells may be of different types (empty, wall,…). The aim of reinforment learning here is to train an agent that starting at an arbitrary location finds a (short) path to some destination cell.

grid world example

Fig. 76 The agent has to discover the grid world and find a short path to the destination cell.#

The agent has to solve two typical problems in reinforcement learning:

  • discover the environment,

  • find a short path without hitting a wall or other restrictions.

Grid worlds may be dynamic (e.g., moving walls).

Environment: Finite grid of cells, maybe of different type.

Agent: A robot, for instance.

States: Position of agend in grid and type of surrounding cells.

Actions: Step left, right, up or down.

Reward: E.g., -1 for each move, 1 for reachng the destination.

Grid worlds are often used in combination with episodic tasks.

Information about the environment typically is complete.

Action and state sets usually are finite (at least if there are only finitely many cell types).

Automated ware houses sometimes are organized as grid worlds, allowing for arbitrary placement of goods or shelves in a grid.

Online Advertising#

Targeted online advertising can be modeled as reinforcement learning problem. Everytime a user visits a website an algorithm decides what ad to show to the user. Ad selection may depend on individual users’ behavior or on group behavior.

Environment: Customer websites and behavior of visting users.

Agent: Ad selection algorithm.

States: Ads shown on websites, visiting users, users’ click behavior.

Actions: Show ad X to user on website Y.

Reward: E.g., 1, if user clicks add, 10 if user buys an advertised product.

Here we have a continuing task.

Whether information available to the agent is complete or incomplete depends onthe concrete setting.

Action and states sets are finite, but large.