Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/rl/ppo_cartpole.py
3507 views
1
"""
2
Title: Proximal Policy Optimization
3
Author: [Ilias Chrysovergis](https://twitter.com/iliachry)
4
Date created: 2021/06/24
5
Last modified: 2024/03/12
6
Description: Implementation of a Proximal Policy Optimization agent for the CartPole-v1 environment.
7
Accelerator: None
8
"""
9
10
"""
11
## Introduction
12
13
This code example solves the CartPole-v1 environment using a Proximal Policy Optimization (PPO) agent.
14
15
### CartPole-v1
16
17
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
18
The system is controlled by applying a force of +1 or -1 to the cart.
19
The pendulum starts upright, and the goal is to prevent it from falling over.
20
A reward of +1 is provided for every timestep that the pole remains upright.
21
The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
22
After 200 steps the episode ends. Thus, the highest return we can get is equal to 200.
23
24
[CartPole-v1](https://gymnasium.farama.org/environments/classic_control/cart_pole/)
25
26
### Proximal Policy Optimization
27
28
PPO is a policy gradient method and can be used for environments with either discrete or continuous action spaces.
29
It trains a stochastic policy in an on-policy way. Also, it utilizes the actor critic method. The actor maps the
30
observation to an action and the critic gives an expectation of the rewards of the agent for the observation given.
31
Firstly, it collects a set of trajectories for each epoch by sampling from the latest version of the stochastic policy.
32
Then, the rewards-to-go and the advantage estimates are computed in order to update the policy and fit the value function.
33
The policy is updated via a stochastic gradient ascent optimizer, while the value function is fitted via some gradient descent algorithm.
34
This procedure is applied for many epochs until the environment is solved.
35
36
![Algorithm](https://i.imgur.com/rd5tda1.png)
37
38
- [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)
39
- [OpenAI Spinning Up docs - PPO](https://spinningup.openai.com/en/latest/algorithms/ppo.html)
40
41
### Note
42
43
This code example uses Keras and Tensorflow v2. It is based on the PPO Original Paper,
44
the OpenAI's Spinning Up docs for PPO, and the OpenAI's Spinning Up implementation of PPO using Tensorflow v1.
45
46
[OpenAI Spinning Up Github - PPO](https://github.com/openai/spinningup/blob/master/spinup/algos/tf1/ppo/ppo.py)
47
"""
48
49
"""
50
## Libraries
51
52
For this example the following libraries are used:
53
54
1. `numpy` for n-dimensional arrays
55
2. `tensorflow` and `keras` for building the deep RL PPO agent
56
3. `gymnasium` for getting everything we need about the environment
57
4. `scipy.signal` for calculating the discounted cumulative sums of vectors
58
"""
59
import os
60
61
os.environ["KERAS_BACKEND"] = "tensorflow"
62
63
import keras
64
from keras import layers
65
66
import numpy as np
67
import tensorflow as tf
68
import gymnasium as gym
69
import scipy.signal
70
71
"""
72
## Functions and class
73
"""
74
75
76
def discounted_cumulative_sums(x, discount):
77
# Discounted cumulative sums of vectors for computing rewards-to-go and advantage estimates
78
return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]
79
80
81
class Buffer:
82
# Buffer for storing trajectories
83
def __init__(self, observation_dimensions, size, gamma=0.99, lam=0.95):
84
# Buffer initialization
85
self.observation_buffer = np.zeros(
86
(size, observation_dimensions), dtype=np.float32
87
)
88
self.action_buffer = np.zeros(size, dtype=np.int32)
89
self.advantage_buffer = np.zeros(size, dtype=np.float32)
90
self.reward_buffer = np.zeros(size, dtype=np.float32)
91
self.return_buffer = np.zeros(size, dtype=np.float32)
92
self.value_buffer = np.zeros(size, dtype=np.float32)
93
self.logprobability_buffer = np.zeros(size, dtype=np.float32)
94
self.gamma, self.lam = gamma, lam
95
self.pointer, self.trajectory_start_index = 0, 0
96
97
def store(self, observation, action, reward, value, logprobability):
98
# Append one step of agent-environment interaction
99
self.observation_buffer[self.pointer] = observation
100
self.action_buffer[self.pointer] = action
101
self.reward_buffer[self.pointer] = reward
102
self.value_buffer[self.pointer] = value
103
self.logprobability_buffer[self.pointer] = logprobability
104
self.pointer += 1
105
106
def finish_trajectory(self, last_value=0):
107
# Finish the trajectory by computing advantage estimates and rewards-to-go
108
path_slice = slice(self.trajectory_start_index, self.pointer)
109
rewards = np.append(self.reward_buffer[path_slice], last_value)
110
values = np.append(self.value_buffer[path_slice], last_value)
111
112
deltas = rewards[:-1] + self.gamma * values[1:] - values[:-1]
113
114
self.advantage_buffer[path_slice] = discounted_cumulative_sums(
115
deltas, self.gamma * self.lam
116
)
117
self.return_buffer[path_slice] = discounted_cumulative_sums(
118
rewards, self.gamma
119
)[:-1]
120
121
self.trajectory_start_index = self.pointer
122
123
def get(self):
124
# Get all data of the buffer and normalize the advantages
125
self.pointer, self.trajectory_start_index = 0, 0
126
advantage_mean, advantage_std = (
127
np.mean(self.advantage_buffer),
128
np.std(self.advantage_buffer),
129
)
130
self.advantage_buffer = (self.advantage_buffer - advantage_mean) / advantage_std
131
return (
132
self.observation_buffer,
133
self.action_buffer,
134
self.advantage_buffer,
135
self.return_buffer,
136
self.logprobability_buffer,
137
)
138
139
140
def mlp(x, sizes, activation=keras.activations.tanh, output_activation=None):
141
# Build a feedforward neural network
142
for size in sizes[:-1]:
143
x = layers.Dense(units=size, activation=activation)(x)
144
return layers.Dense(units=sizes[-1], activation=output_activation)(x)
145
146
147
def logprobabilities(logits, a):
148
# Compute the log-probabilities of taking actions a by using the logits (i.e. the output of the actor)
149
logprobabilities_all = keras.ops.log_softmax(logits)
150
logprobability = keras.ops.sum(
151
keras.ops.one_hot(a, num_actions) * logprobabilities_all, axis=1
152
)
153
return logprobability
154
155
156
seed_generator = keras.random.SeedGenerator(1337)
157
158
159
# Sample action from actor
160
@tf.function
161
def sample_action(observation):
162
logits = actor(observation)
163
action = keras.ops.squeeze(
164
keras.random.categorical(logits, 1, seed=seed_generator), axis=1
165
)
166
return logits, action
167
168
169
# Train the policy by maxizing the PPO-Clip objective
170
@tf.function
171
def train_policy(
172
observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
173
):
174
with tf.GradientTape() as tape: # Record operations for automatic differentiation.
175
ratio = keras.ops.exp(
176
logprobabilities(actor(observation_buffer), action_buffer)
177
- logprobability_buffer
178
)
179
min_advantage = keras.ops.where(
180
advantage_buffer > 0,
181
(1 + clip_ratio) * advantage_buffer,
182
(1 - clip_ratio) * advantage_buffer,
183
)
184
185
policy_loss = -keras.ops.mean(
186
keras.ops.minimum(ratio * advantage_buffer, min_advantage)
187
)
188
policy_grads = tape.gradient(policy_loss, actor.trainable_variables)
189
policy_optimizer.apply_gradients(zip(policy_grads, actor.trainable_variables))
190
191
kl = keras.ops.mean(
192
logprobability_buffer
193
- logprobabilities(actor(observation_buffer), action_buffer)
194
)
195
kl = keras.ops.sum(kl)
196
return kl
197
198
199
# Train the value function by regression on mean-squared error
200
@tf.function
201
def train_value_function(observation_buffer, return_buffer):
202
with tf.GradientTape() as tape: # Record operations for automatic differentiation.
203
value_loss = keras.ops.mean((return_buffer - critic(observation_buffer)) ** 2)
204
value_grads = tape.gradient(value_loss, critic.trainable_variables)
205
value_optimizer.apply_gradients(zip(value_grads, critic.trainable_variables))
206
207
208
"""
209
## Hyperparameters
210
"""
211
212
# Hyperparameters of the PPO algorithm
213
steps_per_epoch = 4000
214
epochs = 30
215
gamma = 0.99
216
clip_ratio = 0.2
217
policy_learning_rate = 3e-4
218
value_function_learning_rate = 1e-3
219
train_policy_iterations = 80
220
train_value_iterations = 80
221
lam = 0.97
222
target_kl = 0.01
223
hidden_sizes = (64, 64)
224
225
# True if you want to render the environment
226
render = False
227
228
"""
229
## Initializations
230
"""
231
232
# Initialize the environment and get the dimensionality of the
233
# observation space and the number of possible actions
234
env = gym.make("CartPole-v1")
235
observation_dimensions = env.observation_space.shape[0]
236
num_actions = env.action_space.n
237
238
# Initialize the buffer
239
buffer = Buffer(observation_dimensions, steps_per_epoch)
240
241
# Initialize the actor and the critic as keras models
242
observation_input = keras.Input(shape=(observation_dimensions,), dtype="float32")
243
logits = mlp(observation_input, list(hidden_sizes) + [num_actions])
244
actor = keras.Model(inputs=observation_input, outputs=logits)
245
value = keras.ops.squeeze(mlp(observation_input, list(hidden_sizes) + [1]), axis=1)
246
critic = keras.Model(inputs=observation_input, outputs=value)
247
248
# Initialize the policy and the value function optimizers
249
policy_optimizer = keras.optimizers.Adam(learning_rate=policy_learning_rate)
250
value_optimizer = keras.optimizers.Adam(learning_rate=value_function_learning_rate)
251
252
# Initialize the observation, episode return and episode length
253
observation, _ = env.reset()
254
episode_return, episode_length = 0, 0
255
256
"""
257
## Train
258
"""
259
# Iterate over the number of epochs
260
for epoch in range(epochs):
261
# Initialize the sum of the returns, lengths and number of episodes for each epoch
262
sum_return = 0
263
sum_length = 0
264
num_episodes = 0
265
266
# Iterate over the steps of each epoch
267
for t in range(steps_per_epoch):
268
if render:
269
env.render()
270
271
# Get the logits, action, and take one step in the environment
272
observation = observation.reshape(1, -1)
273
logits, action = sample_action(observation)
274
observation_new, reward, done, _, _ = env.step(action[0].numpy())
275
episode_return += reward
276
episode_length += 1
277
278
# Get the value and log-probability of the action
279
value_t = critic(observation)
280
logprobability_t = logprobabilities(logits, action)
281
282
# Store obs, act, rew, v_t, logp_pi_t
283
buffer.store(observation, action, reward, value_t, logprobability_t)
284
285
# Update the observation
286
observation = observation_new
287
288
# Finish trajectory if reached to a terminal state
289
terminal = done
290
if terminal or (t == steps_per_epoch - 1):
291
last_value = 0 if done else critic(observation.reshape(1, -1))
292
buffer.finish_trajectory(last_value)
293
sum_return += episode_return
294
sum_length += episode_length
295
num_episodes += 1
296
observation, _ = env.reset()
297
episode_return, episode_length = 0, 0
298
299
# Get values from the buffer
300
(
301
observation_buffer,
302
action_buffer,
303
advantage_buffer,
304
return_buffer,
305
logprobability_buffer,
306
) = buffer.get()
307
308
# Update the policy and implement early stopping using KL divergence
309
for _ in range(train_policy_iterations):
310
kl = train_policy(
311
observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
312
)
313
if kl > 1.5 * target_kl:
314
# Early Stopping
315
break
316
317
# Update the value function
318
for _ in range(train_value_iterations):
319
train_value_function(observation_buffer, return_buffer)
320
321
# Print mean return and length for each epoch
322
print(
323
f" Epoch: {epoch + 1}. Mean Return: {sum_return / num_episodes}. Mean Length: {sum_length / num_episodes}"
324
)
325
326
"""
327
## Visualizations
328
329
Before training:
330
331
![Imgur](https://i.imgur.com/rKXDoMC.gif)
332
333
After 8 epochs of training:
334
335
![Imgur](https://i.imgur.com/M0FbhF0.gif)
336
337
After 20 epochs of training:
338
339
![Imgur](https://i.imgur.com/tKhTEaF.gif)
340
"""
341
342