Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/rl/actor_critic_cartpole.py
3507 views
1
"""
2
Title: Actor Critic Method
3
Author: [Apoorv Nandan](https://twitter.com/NandanApoorv)
4
Date created: 2020/05/13
5
Last modified: 2024/02/22
6
Description: Implement Actor Critic Method in CartPole environment.
7
Accelerator: NONE
8
Converted to Keras 3 by: [Sitam Meur](https://github.com/sitamgithub-MSIT)
9
"""
10
11
"""
12
## Introduction
13
14
This script shows an implementation of Actor Critic method on CartPole-V0 environment.
15
16
### Actor Critic Method
17
18
As an agent takes actions and moves through an environment, it learns to map
19
the observed state of the environment to two possible outputs:
20
21
1. Recommended action: A probability value for each action in the action space.
22
The part of the agent responsible for this output is called the **actor**.
23
2. Estimated rewards in the future: Sum of all rewards it expects to receive in the
24
future. The part of the agent responsible for this output is the **critic**.
25
26
Agent and Critic learn to perform their tasks, such that the recommended actions
27
from the actor maximize the rewards.
28
29
### CartPole-V0
30
31
A pole is attached to a cart placed on a frictionless track. The agent has to apply
32
force to move the cart. It is rewarded for every time step the pole
33
remains upright. The agent, therefore, must learn to keep the pole from falling over.
34
35
### References
36
37
- [Environment documentation](https://gymnasium.farama.org/environments/classic_control/cart_pole/)
38
- [CartPole paper](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf)
39
- [Actor Critic Method](https://hal.inria.fr/hal-00840470/document)
40
"""
41
"""
42
## Setup
43
"""
44
45
import os
46
47
os.environ["KERAS_BACKEND"] = "tensorflow"
48
import gym
49
import numpy as np
50
import keras
51
from keras import ops
52
from keras import layers
53
import tensorflow as tf
54
55
# Configuration parameters for the whole setup
56
seed = 42
57
gamma = 0.99 # Discount factor for past rewards
58
max_steps_per_episode = 10000
59
# Adding `render_mode='human'` will show the attempts of the agent
60
env = gym.make("CartPole-v0") # Create the environment
61
env.reset(seed=seed)
62
eps = np.finfo(np.float32).eps.item() # Smallest number such that 1.0 + eps != 1.0
63
64
"""
65
## Implement Actor Critic network
66
67
This network learns two functions:
68
69
1. Actor: This takes as input the state of our environment and returns a
70
probability value for each action in its action space.
71
2. Critic: This takes as input the state of our environment and returns
72
an estimate of total rewards in the future.
73
74
In our implementation, they share the initial layer.
75
"""
76
77
num_inputs = 4
78
num_actions = 2
79
num_hidden = 128
80
81
inputs = layers.Input(shape=(num_inputs,))
82
common = layers.Dense(num_hidden, activation="relu")(inputs)
83
action = layers.Dense(num_actions, activation="softmax")(common)
84
critic = layers.Dense(1)(common)
85
86
model = keras.Model(inputs=inputs, outputs=[action, critic])
87
88
"""
89
## Train
90
"""
91
92
optimizer = keras.optimizers.Adam(learning_rate=0.01)
93
huber_loss = keras.losses.Huber()
94
action_probs_history = []
95
critic_value_history = []
96
rewards_history = []
97
running_reward = 0
98
episode_count = 0
99
100
while True: # Run until solved
101
state = env.reset()[0]
102
episode_reward = 0
103
with tf.GradientTape() as tape:
104
for timestep in range(1, max_steps_per_episode):
105
106
state = ops.convert_to_tensor(state)
107
state = ops.expand_dims(state, 0)
108
109
# Predict action probabilities and estimated future rewards
110
# from environment state
111
action_probs, critic_value = model(state)
112
critic_value_history.append(critic_value[0, 0])
113
114
# Sample action from action probability distribution
115
action = np.random.choice(num_actions, p=np.squeeze(action_probs))
116
action_probs_history.append(ops.log(action_probs[0, action]))
117
118
# Apply the sampled action in our environment
119
state, reward, done, *_ = env.step(action)
120
rewards_history.append(reward)
121
episode_reward += reward
122
123
if done:
124
break
125
126
# Update running reward to check condition for solving
127
running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward
128
129
# Calculate expected value from rewards
130
# - At each timestep what was the total reward received after that timestep
131
# - Rewards in the past are discounted by multiplying them with gamma
132
# - These are the labels for our critic
133
returns = []
134
discounted_sum = 0
135
for r in rewards_history[::-1]:
136
discounted_sum = r + gamma * discounted_sum
137
returns.insert(0, discounted_sum)
138
139
# Normalize
140
returns = np.array(returns)
141
returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
142
returns = returns.tolist()
143
144
# Calculating loss values to update our network
145
history = zip(action_probs_history, critic_value_history, returns)
146
actor_losses = []
147
critic_losses = []
148
for log_prob, value, ret in history:
149
# At this point in history, the critic estimated that we would get a
150
# total reward = `value` in the future. We took an action with log probability
151
# of `log_prob` and ended up receiving a total reward = `ret`.
152
# The actor must be updated so that it predicts an action that leads to
153
# high rewards (compared to critic's estimate) with high probability.
154
diff = ret - value
155
actor_losses.append(-log_prob * diff) # actor loss
156
157
# The critic must be updated so that it predicts a better estimate of
158
# the future rewards.
159
critic_losses.append(
160
huber_loss(ops.expand_dims(value, 0), ops.expand_dims(ret, 0))
161
)
162
163
# Backpropagation
164
loss_value = sum(actor_losses) + sum(critic_losses)
165
grads = tape.gradient(loss_value, model.trainable_variables)
166
optimizer.apply_gradients(zip(grads, model.trainable_variables))
167
168
# Clear the loss and reward history
169
action_probs_history.clear()
170
critic_value_history.clear()
171
rewards_history.clear()
172
173
# Log details
174
episode_count += 1
175
if episode_count % 10 == 0:
176
template = "running reward: {:.2f} at episode {}"
177
print(template.format(running_reward, episode_count))
178
179
if running_reward > 195: # Condition to consider the task solved
180
print("Solved at episode {}!".format(episode_count))
181
break
182
"""
183
## Visualizations
184
In early stages of training:
185
![Imgur](https://i.imgur.com/5gCs5kH.gif)
186
187
In later stages of training:
188
![Imgur](https://i.imgur.com/5ziiZUD.gif)
189
"""
190
191