merve HF staff commited on
Commit
56f105f
·
1 Parent(s): c98dba1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -12,6 +12,8 @@ This repo contains the model and the notebook [to this Keras example on PPO for
12
 
13
  Full credits to: Ilias Chrysovergis
14
 
 
 
15
  ## Background Information
16
  ### CartPole-v0
17
  A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center. After 200 steps the episode ends. Thus, the highest return we can get is equal to 200.
@@ -19,5 +21,3 @@ A pole is attached by an un-actuated joint to a cart, which moves along a fricti
19
  ### Proximal Policy Optimization
20
  PPO is a policy gradient method and can be used for environments with either discrete or continuous action spaces. It trains a stochastic policy in an on-policy way. Also, it utilizes the actor critic method. The actor maps the observation to an action and the critic gives an expectation of the rewards of the agent for the observation given. Firstly, it collects a set of trajectories for each epoch by sampling from the latest version of the stochastic policy. Then, the rewards-to-go and the advantage estimates are computed in order to update the policy and fit the value function. The policy is updated via a stochastic gradient ascent optimizer, while the value function is fitted via some gradient descent algorithm. This procedure is applied for many epochs until the environment is solved.
21
 
22
-
23
- ![cartpole_gif](https://i.imgur.com/tKhTEaF.gif)
 
12
 
13
  Full credits to: Ilias Chrysovergis
14
 
15
+ ![cartpole_gif](https://i.imgur.com/tKhTEaF.gif)
16
+
17
  ## Background Information
18
  ### CartPole-v0
19
  A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center. After 200 steps the episode ends. Thus, the highest return we can get is equal to 200.
 
21
  ### Proximal Policy Optimization
22
  PPO is a policy gradient method and can be used for environments with either discrete or continuous action spaces. It trains a stochastic policy in an on-policy way. Also, it utilizes the actor critic method. The actor maps the observation to an action and the critic gives an expectation of the rewards of the agent for the observation given. Firstly, it collects a set of trajectories for each epoch by sampling from the latest version of the stochastic policy. Then, the rewards-to-go and the advantage estimates are computed in order to update the policy and fit the value function. The policy is updated via a stochastic gradient ascent optimizer, while the value function is fitted via some gradient descent algorithm. This procedure is applied for many epochs until the environment is solved.
23