CoCalc -- ppo_cartpole.py

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/rl/ppo_cartpole.py
³⁵⁰⁷ views
1
"""
2
Title: Proximal Policy Optimization
3
Author: [Ilias Chrysovergis](https://twitter.com/iliachry)
4
Date created: 2021/06/24
5
Last modified: 2024/03/12
6
Description: Implementation of a Proximal Policy Optimization agent for the CartPole-v1 environment.
7
Accelerator: None
8
"""
9

10
"""
11
## Introduction
12

13
This code example solves the CartPole-v1 environment using a Proximal Policy Optimization (PPO) agent.
14

15
### CartPole-v1
16

17
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
18
The system is controlled by applying a force of +1 or -1 to the cart.
19
The pendulum starts upright, and the goal is to prevent it from falling over.
20
A reward of +1 is provided for every timestep that the pole remains upright.
21
The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
22
After 200 steps the episode ends. Thus, the highest return we can get is equal to 200.
23

24
[CartPole-v1](https://gymnasium.farama.org/environments/classic_control/cart_pole/)
25

26
### Proximal Policy Optimization
27

28
PPO is a policy gradient method and can be used for environments with either discrete or continuous action spaces.
29
It trains a stochastic policy in an on-policy way. Also, it utilizes the actor critic method. The actor maps the
30
observation to an action and the critic gives an expectation of the rewards of the agent for the observation given.
31
Firstly, it collects a set of trajectories for each epoch by sampling from the latest version of the stochastic policy.
32
Then, the rewards-to-go and the advantage estimates are computed in order to update the policy and fit the value function.
33
The policy is updated via a stochastic gradient ascent optimizer, while the value function is fitted via some gradient descent algorithm.
34
This procedure is applied for many epochs until the environment is solved.
35

36
![Algorithm](https://i.imgur.com/rd5tda1.png)
37

38
- [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)
39
- [OpenAI Spinning Up docs - PPO](https://spinningup.openai.com/en/latest/algorithms/ppo.html)
40

41
### Note
42

43
This code example uses Keras and Tensorflow v2. It is based on the PPO Original Paper,
44
the OpenAI's Spinning Up docs for PPO, and the OpenAI's Spinning Up implementation of PPO using Tensorflow v1.
45

46
[OpenAI Spinning Up Github - PPO](https://github.com/openai/spinningup/blob/master/spinup/algos/tf1/ppo/ppo.py)
47
"""
48

49
"""
50
## Libraries
51

52
For this example the following libraries are used:
53

54
1. `numpy` for n-dimensional arrays
55
2. `tensorflow` and `keras` for building the deep RL PPO agent
56
3. `gymnasium` for getting everything we need about the environment
57
4. `scipy.signal` for calculating the discounted cumulative sums of vectors
58
"""
59
import os
60

61
os.environ["KERAS_BACKEND"] = "tensorflow"
62

63
import keras
64
from keras import layers
65

66
import numpy as np
67
import tensorflow as tf
68
import gymnasium as gym
69
import scipy.signal
70

71
"""
72
## Functions and class
73
"""
74

75

76
def discounted_cumulative_sums(x, discount):
77
    # Discounted cumulative sums of vectors for computing rewards-to-go and advantage estimates
78
    return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]
79

80

81
class Buffer:
82
    # Buffer for storing trajectories
83
    def __init__(self, observation_dimensions, size, gamma=0.99, lam=0.95):
84
        # Buffer initialization
85
        self.observation_buffer = np.zeros(
86
            (size, observation_dimensions), dtype=np.float32
87
        )
88
        self.action_buffer = np.zeros(size, dtype=np.int32)
89
        self.advantage_buffer = np.zeros(size, dtype=np.float32)
90
        self.reward_buffer = np.zeros(size, dtype=np.float32)
91
        self.return_buffer = np.zeros(size, dtype=np.float32)
92
        self.value_buffer = np.zeros(size, dtype=np.float32)
93
        self.logprobability_buffer = np.zeros(size, dtype=np.float32)
94
        self.gamma, self.lam = gamma, lam
95
        self.pointer, self.trajectory_start_index = 0, 0
96

97
    def store(self, observation, action, reward, value, logprobability):
98
        # Append one step of agent-environment interaction
99
        self.observation_buffer[self.pointer] = observation
100
        self.action_buffer[self.pointer] = action
101
        self.reward_buffer[self.pointer] = reward
102
        self.value_buffer[self.pointer] = value
103
        self.logprobability_buffer[self.pointer] = logprobability
104
        self.pointer += 1
105

106
    def finish_trajectory(self, last_value=0):
107
        # Finish the trajectory by computing advantage estimates and rewards-to-go
108
        path_slice = slice(self.trajectory_start_index, self.pointer)
109
        rewards = np.append(self.reward_buffer[path_slice], last_value)
110
        values = np.append(self.value_buffer[path_slice], last_value)
111

112
        deltas = rewards[:-1] + self.gamma * values[1:] - values[:-1]
113

114
        self.advantage_buffer[path_slice] = discounted_cumulative_sums(
115
            deltas, self.gamma * self.lam
116
        )
117
        self.return_buffer[path_slice] = discounted_cumulative_sums(
118
            rewards, self.gamma
119
        )[:-1]
120

121
        self.trajectory_start_index = self.pointer
122

123
    def get(self):
124
        # Get all data of the buffer and normalize the advantages
125
        self.pointer, self.trajectory_start_index = 0, 0
126
        advantage_mean, advantage_std = (
127
            np.mean(self.advantage_buffer),
128
            np.std(self.advantage_buffer),
129
        )
130
        self.advantage_buffer = (self.advantage_buffer - advantage_mean) / advantage_std
131
        return (
132
            self.observation_buffer,
133
            self.action_buffer,
134
            self.advantage_buffer,
135
            self.return_buffer,
136
            self.logprobability_buffer,
137
        )
138

139

140
def mlp(x, sizes, activation=keras.activations.tanh, output_activation=None):
141
    # Build a feedforward neural network
142
    for size in sizes[:-1]:
143
        x = layers.Dense(units=size, activation=activation)(x)
144
    return layers.Dense(units=sizes[-1], activation=output_activation)(x)
145

146

147
def logprobabilities(logits, a):
148
    # Compute the log-probabilities of taking actions a by using the logits (i.e. the output of the actor)
149
    logprobabilities_all = keras.ops.log_softmax(logits)
150
    logprobability = keras.ops.sum(
151
        keras.ops.one_hot(a, num_actions) * logprobabilities_all, axis=1
152
    )
153
    return logprobability
154

155

156
seed_generator = keras.random.SeedGenerator(1337)
157

158

159
# Sample action from actor
160
@tf.function
161
def sample_action(observation):
162
    logits = actor(observation)
163
    action = keras.ops.squeeze(
164
        keras.random.categorical(logits, 1, seed=seed_generator), axis=1
165
    )
166
    return logits, action
167

168

169
# Train the policy by maxizing the PPO-Clip objective
170
@tf.function
171
def train_policy(
172
    observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
173
):
174
    with tf.GradientTape() as tape:  # Record operations for automatic differentiation.
175
        ratio = keras.ops.exp(
176
            logprobabilities(actor(observation_buffer), action_buffer)
177
            - logprobability_buffer
178
        )
179
        min_advantage = keras.ops.where(
180
            advantage_buffer > 0,
181
            (1 + clip_ratio) * advantage_buffer,
182
            (1 - clip_ratio) * advantage_buffer,
183
        )
184

185
        policy_loss = -keras.ops.mean(
186
            keras.ops.minimum(ratio * advantage_buffer, min_advantage)
187
        )
188
    policy_grads = tape.gradient(policy_loss, actor.trainable_variables)
189
    policy_optimizer.apply_gradients(zip(policy_grads, actor.trainable_variables))
190

191
    kl = keras.ops.mean(
192
        logprobability_buffer
193
        - logprobabilities(actor(observation_buffer), action_buffer)
194
    )
195
    kl = keras.ops.sum(kl)
196
    return kl
197

198

199
# Train the value function by regression on mean-squared error
200
@tf.function
201
def train_value_function(observation_buffer, return_buffer):
202
    with tf.GradientTape() as tape:  # Record operations for automatic differentiation.
203
        value_loss = keras.ops.mean((return_buffer - critic(observation_buffer)) ** 2)
204
    value_grads = tape.gradient(value_loss, critic.trainable_variables)
205
    value_optimizer.apply_gradients(zip(value_grads, critic.trainable_variables))
206

207

208
"""
209
## Hyperparameters
210
"""
211

212
# Hyperparameters of the PPO algorithm
213
steps_per_epoch = 4000
214
epochs = 30
215
gamma = 0.99
216
clip_ratio = 0.2
217
policy_learning_rate = 3e-4
218
value_function_learning_rate = 1e-3
219
train_policy_iterations = 80
220
train_value_iterations = 80
221
lam = 0.97
222
target_kl = 0.01
223
hidden_sizes = (64, 64)
224

225
# True if you want to render the environment
226
render = False
227

228
"""
229
## Initializations
230
"""
231

232
# Initialize the environment and get the dimensionality of the
233
# observation space and the number of possible actions
234
env = gym.make("CartPole-v1")
235
observation_dimensions = env.observation_space.shape[0]
236
num_actions = env.action_space.n
237

238
# Initialize the buffer
239
buffer = Buffer(observation_dimensions, steps_per_epoch)
240

241
# Initialize the actor and the critic as keras models
242
observation_input = keras.Input(shape=(observation_dimensions,), dtype="float32")
243
logits = mlp(observation_input, list(hidden_sizes) + [num_actions])
244
actor = keras.Model(inputs=observation_input, outputs=logits)
245
value = keras.ops.squeeze(mlp(observation_input, list(hidden_sizes) + [1]), axis=1)
246
critic = keras.Model(inputs=observation_input, outputs=value)
247

248
# Initialize the policy and the value function optimizers
249
policy_optimizer = keras.optimizers.Adam(learning_rate=policy_learning_rate)
250
value_optimizer = keras.optimizers.Adam(learning_rate=value_function_learning_rate)
251

252
# Initialize the observation, episode return and episode length
253
observation, _ = env.reset()
254
episode_return, episode_length = 0, 0
255

256
"""
257
## Train
258
"""
259
# Iterate over the number of epochs
260
for epoch in range(epochs):
261
    # Initialize the sum of the returns, lengths and number of episodes for each epoch
262
    sum_return = 0
263
    sum_length = 0
264
    num_episodes = 0
265

266
    # Iterate over the steps of each epoch
267
    for t in range(steps_per_epoch):
268
        if render:
269
            env.render()
270

271
        # Get the logits, action, and take one step in the environment
272
        observation = observation.reshape(1, -1)
273
        logits, action = sample_action(observation)
274
        observation_new, reward, done, _, _ = env.step(action[0].numpy())
275
        episode_return += reward
276
        episode_length += 1
277

278
        # Get the value and log-probability of the action
279
        value_t = critic(observation)
280
        logprobability_t = logprobabilities(logits, action)
281

282
        # Store obs, act, rew, v_t, logp_pi_t
283
        buffer.store(observation, action, reward, value_t, logprobability_t)
284

285
        # Update the observation
286
        observation = observation_new
287

288
        # Finish trajectory if reached to a terminal state
289
        terminal = done
290
        if terminal or (t == steps_per_epoch - 1):
291
            last_value = 0 if done else critic(observation.reshape(1, -1))
292
            buffer.finish_trajectory(last_value)
293
            sum_return += episode_return
294
            sum_length += episode_length
295
            num_episodes += 1
296
            observation, _ = env.reset()
297
            episode_return, episode_length = 0, 0
298

299
    # Get values from the buffer
300
    (
301
        observation_buffer,
302
        action_buffer,
303
        advantage_buffer,
304
        return_buffer,
305
        logprobability_buffer,
306
    ) = buffer.get()
307

308
    # Update the policy and implement early stopping using KL divergence
309
    for _ in range(train_policy_iterations):
310
        kl = train_policy(
311
            observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
312
        )
313
        if kl > 1.5 * target_kl:
314
            # Early Stopping
315
            break
316

317
    # Update the value function
318
    for _ in range(train_value_iterations):
319
        train_value_function(observation_buffer, return_buffer)
320

321
    # Print mean return and length for each epoch
322
    print(
323
        f" Epoch: {epoch + 1}. Mean Return: {sum_return / num_episodes}. Mean Length: {sum_length / num_episodes}"
324
    )
325

326
"""
327
## Visualizations
328

329
Before training:
330

331
![Imgur](https://i.imgur.com/rKXDoMC.gif)
332

333
After 8 epochs of training:
334

335
![Imgur](https://i.imgur.com/M0FbhF0.gif)
336

337
After 20 epochs of training:
338

339
![Imgur](https://i.imgur.com/tKhTEaF.gif)
340
"""
341

342
Product

Resources

Company