Policy Gradient Method#

Read Policy Gradient Methods and also have a look at The Cart Pole Environment project before you start.

Task: Implement the SARSA-base actor-critic method with baseline for the cart pole environment. Represent value function estimates by a two-layer ANN with 30 neurons per layer. For the policy use a same-sized ANN. Train about some 100 episodes and print obtained return after each episode.

Use TensorFlow’s Adam optimizer instead of manually implementing the ANN weight update.

Note that the cart pole environment always yields reward 1, even if the pole fell down. Maybe it’s better to use reward 0 in that case.

Solution:

# your solution

Task: After sufficiently long training run an episode and render the agent’s behavior.

Solution:

# your solution