Dynamic Programming#

Read Dynamic Programming before you start.

In this project (and all others of this project series) we use the grid world simulator Frozen Lake originally developed by OpenAI in their gym package (until they removed the ‘non’ in ‘non-profit organization) and now maintained by the non-profit Farama Foundation in their gymnasium package. See Announcing The Farama Foundation for some more background information on this transition.

Task: Read about Gymnasium and the Frozen Lake simulator. Then install Gymnasium.

The Environment#

Task: Create the standard 8-by-8 Frozen Lake environment object and render the environment. Use is_slippery=True

Solution:

# your solution

Task: Write your own render function for getting a more pleasant rendering. It should take the environment as argument. The frozen lake map is contained as list of lists (rows) in env.desc. Note that we want to test dynamic programming. So we do not care about a starting position, because we will solve the problem for all starting positions at once.

# your solution

Task: Get the environment dynamics for \(p(s',r,s,a)\) for all arguments (four dimensional array). Relevant information is in env.P. Check that \(p(0,0,0,0)=2/3\), else your solution is not correct.

# your solution

Value Iteration#

Task: Implement asynchronous value iteration to get the optimal state-value function \(v_\ast\). Use \(\gamma=1\). Show values in an 8-by-8 grid.

# your solution

Task: From optimal state values get optimal action values.

# your solution

Task: Get an optimal policy from optimal action values.

# your solution

Task: Visualize the policy.

# your solution

Task: Run your code with the following parameters and explain what you see (optimal value function, optimal policy):

  • slippery=False, \(\gamma=0.5\)

  • slippery=False, \(\gamma=0\)

  • slippery=False, \(\gamma=1\)

  • slippery=True, \(\gamma=0.5\)

  • slippery=True, \(\gamma=0\)

  • slippery=True, \(\gamma=1\)

# your answer

Policy Iteration#

Task: Implement policy iteration for state values. Use a deterministic random initial policy.

# your solution