Policy Gradient Method
Policy Gradient Method#
Read Policy Gradient Methods and also have a look at The Cart Pole Environment project before you start.
Task: Implement the SARSA-base actor-critic method with baseline for the cart pole environment. Represent value function estimates by a two-layer ANN with 30 neurons per layer. For the policy use a same-sized ANN. Train about some 100 episodes and print obtained return after each episode.
Use TensorFlow’s Adam
optimizer instead of manually implementing the ANN weight update.
Note that the cart pole environment always yields reward 1, even if the pole fell down. Maybe it’s better to use reward 0 in that case.
Solution:
# your solution
Task: After sufficiently long training run an episode and render the agent’s behavior.
Solution:
# your solution