Markov Decision Processes#

Before solving these exercises you should have read Markov Decision Processes.

We consider a tiny 3-by-3 grid world.

The 3-by-3 grid world

State information contains the agent’s position only. Actions are ‘go left’, ‘go right’, ‘go up’, ‘go down’. Invalid actions do not change state. Moving to cell 5 yields reward 1 and places the agent randomly on 2, 4, 6 or 8. All other rewards are 0.

Environment#

Task: How many different states do we have in this grid world? Is moving in this grid world an collecting as much reward as possible an episodic or a continuing task? Is the task stationary or non-stationary?

Solution:

# your answer

Task: Write down the environment dynamics \(p\) for the grid world defined above.

Solution:

# your answer

Policies#

Task: Write down the uniformly random policy for the grid world.

Solution:

# your answer

Task: Find a deterministic policy that maximizes return (with discount factor \(\gamma<1\)).

Solution:

# your answer

Task: Are there policies that maximize return for \(\gamma=1\), but not for \(\gamma<1\)?

Solution:

# your answer

Bellman Equations#

Task: Write down the Bellman equations for state values for the deterministic policy 1→2→3→6→9→8→7→4→5. Solve the equations for different \(\gamma\) (try \(\gamma=1\), too). Then calculate action values from the state values.

Solution:

# your answer

Task: Calculate the state-value function for your return maximizing policy from above. Prove that the policy is indeed optimal.

Solution:

# your answer