CoCalc -- actor_critic

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/rl/actor_critic_cartpole.py
³⁵⁰⁷ views
1
"""
2
Title: Actor Critic Method
3
Author: [Apoorv Nandan](https://twitter.com/NandanApoorv)
4
Date created: 2020/05/13
5
Last modified: 2024/02/22
6
Description: Implement Actor Critic Method in CartPole environment.
7
Accelerator: NONE
8
Converted to Keras 3 by: [Sitam Meur](https://github.com/sitamgithub-MSIT)
9
"""
10

11
"""
12
## Introduction
13

14
This script shows an implementation of Actor Critic method on CartPole-V0 environment.
15

16
### Actor Critic Method
17

18
As an agent takes actions and moves through an environment, it learns to map
19
the observed state of the environment to two possible outputs:
20

21
1. Recommended action: A probability value for each action in the action space.
22
   The part of the agent responsible for this output is called the **actor**.
23
2. Estimated rewards in the future: Sum of all rewards it expects to receive in the
24
   future. The part of the agent responsible for this output is the **critic**.
25

26
Agent and Critic learn to perform their tasks, such that the recommended actions
27
from the actor maximize the rewards.
28

29
### CartPole-V0
30

31
A pole is attached to a cart placed on a frictionless track. The agent has to apply
32
force to move the cart. It is rewarded for every time step the pole
33
remains upright. The agent, therefore, must learn to keep the pole from falling over.
34

35
### References
36

37
- [Environment documentation](https://gymnasium.farama.org/environments/classic_control/cart_pole/)
38
- [CartPole paper](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf)
39
- [Actor Critic Method](https://hal.inria.fr/hal-00840470/document)
40
"""
41
"""
42
## Setup
43
"""
44

45
import os
46

47
os.environ["KERAS_BACKEND"] = "tensorflow"
48
import gym
49
import numpy as np
50
import keras
51
from keras import ops
52
from keras import layers
53
import tensorflow as tf
54

55
# Configuration parameters for the whole setup
56
seed = 42
57
gamma = 0.99  # Discount factor for past rewards
58
max_steps_per_episode = 10000
59
# Adding `render_mode='human'` will show the attempts of the agent
60
env = gym.make("CartPole-v0")  # Create the environment
61
env.reset(seed=seed)
62
eps = np.finfo(np.float32).eps.item()  # Smallest number such that 1.0 + eps != 1.0
63

64
"""
65
## Implement Actor Critic network
66

67
This network learns two functions:
68

69
1. Actor: This takes as input the state of our environment and returns a
70
probability value for each action in its action space.
71
2. Critic: This takes as input the state of our environment and returns
72
an estimate of total rewards in the future.
73

74
In our implementation, they share the initial layer.
75
"""
76

77
num_inputs = 4
78
num_actions = 2
79
num_hidden = 128
80

81
inputs = layers.Input(shape=(num_inputs,))
82
common = layers.Dense(num_hidden, activation="relu")(inputs)
83
action = layers.Dense(num_actions, activation="softmax")(common)
84
critic = layers.Dense(1)(common)
85

86
model = keras.Model(inputs=inputs, outputs=[action, critic])
87

88
"""
89
## Train
90
"""
91

92
optimizer = keras.optimizers.Adam(learning_rate=0.01)
93
huber_loss = keras.losses.Huber()
94
action_probs_history = []
95
critic_value_history = []
96
rewards_history = []
97
running_reward = 0
98
episode_count = 0
99

100
while True:  # Run until solved
101
    state = env.reset()[0]
102
    episode_reward = 0
103
    with tf.GradientTape() as tape:
104
        for timestep in range(1, max_steps_per_episode):
105

106
            state = ops.convert_to_tensor(state)
107
            state = ops.expand_dims(state, 0)
108

109
            # Predict action probabilities and estimated future rewards
110
            # from environment state
111
            action_probs, critic_value = model(state)
112
            critic_value_history.append(critic_value[0, 0])
113

114
            # Sample action from action probability distribution
115
            action = np.random.choice(num_actions, p=np.squeeze(action_probs))
116
            action_probs_history.append(ops.log(action_probs[0, action]))
117

118
            # Apply the sampled action in our environment
119
            state, reward, done, *_ = env.step(action)
120
            rewards_history.append(reward)
121
            episode_reward += reward
122

123
            if done:
124
                break
125

126
        # Update running reward to check condition for solving
127
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward
128

129
        # Calculate expected value from rewards
130
        # - At each timestep what was the total reward received after that timestep
131
        # - Rewards in the past are discounted by multiplying them with gamma
132
        # - These are the labels for our critic
133
        returns = []
134
        discounted_sum = 0
135
        for r in rewards_history[::-1]:
136
            discounted_sum = r + gamma * discounted_sum
137
            returns.insert(0, discounted_sum)
138

139
        # Normalize
140
        returns = np.array(returns)
141
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
142
        returns = returns.tolist()
143

144
        # Calculating loss values to update our network
145
        history = zip(action_probs_history, critic_value_history, returns)
146
        actor_losses = []
147
        critic_losses = []
148
        for log_prob, value, ret in history:
149
            # At this point in history, the critic estimated that we would get a
150
            # total reward = `value` in the future. We took an action with log probability
151
            # of `log_prob` and ended up receiving a total reward = `ret`.
152
            # The actor must be updated so that it predicts an action that leads to
153
            # high rewards (compared to critic's estimate) with high probability.
154
            diff = ret - value
155
            actor_losses.append(-log_prob * diff)  # actor loss
156

157
            # The critic must be updated so that it predicts a better estimate of
158
            # the future rewards.
159
            critic_losses.append(
160
                huber_loss(ops.expand_dims(value, 0), ops.expand_dims(ret, 0))
161
            )
162

163
        # Backpropagation
164
        loss_value = sum(actor_losses) + sum(critic_losses)
165
        grads = tape.gradient(loss_value, model.trainable_variables)
166
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
167

168
        # Clear the loss and reward history
169
        action_probs_history.clear()
170
        critic_value_history.clear()
171
        rewards_history.clear()
172

173
    # Log details
174
    episode_count += 1
175
    if episode_count % 10 == 0:
176
        template = "running reward: {:.2f} at episode {}"
177
        print(template.format(running_reward, episode_count))
178

179
    if running_reward > 195:  # Condition to consider the task solved
180
        print("Solved at episode {}!".format(episode_count))
181
        break
182
"""
183
## Visualizations
184
In early stages of training:
185
![Imgur](https://i.imgur.com/5gCs5kH.gif)
186

187
In later stages of training:
188
![Imgur](https://i.imgur.com/5ziiZUD.gif)
189
"""
190

191
Product

Resources

Company