Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/timeseries/timeseries_anomaly_detection.py
3507 views
1
"""
2
Title: Timeseries anomaly detection using an Autoencoder
3
Author: [pavithrasv](https://github.com/pavithrasv)
4
Date created: 2020/05/31
5
Last modified: 2020/05/31
6
Description: Detect anomalies in a timeseries using an Autoencoder.
7
Accelerator: GPU
8
"""
9
10
"""
11
## Introduction
12
13
This script demonstrates how you can use a reconstruction convolutional
14
autoencoder model to detect anomalies in timeseries data.
15
"""
16
17
"""
18
## Setup
19
"""
20
21
import numpy as np
22
import pandas as pd
23
import keras
24
from keras import layers
25
from matplotlib import pyplot as plt
26
27
"""
28
## Load the data
29
30
We will use the [Numenta Anomaly Benchmark(NAB)](
31
https://www.kaggle.com/boltzmannbrain/nab) dataset. It provides artificial
32
timeseries data containing labeled anomalous periods of behavior. Data are
33
ordered, timestamped, single-valued metrics.
34
35
We will use the `art_daily_small_noise.csv` file for training and the
36
`art_daily_jumpsup.csv` file for testing. The simplicity of this dataset
37
allows us to demonstrate anomaly detection effectively.
38
"""
39
40
master_url_root = "https://raw.githubusercontent.com/numenta/NAB/master/data/"
41
42
df_small_noise_url_suffix = "artificialNoAnomaly/art_daily_small_noise.csv"
43
df_small_noise_url = master_url_root + df_small_noise_url_suffix
44
df_small_noise = pd.read_csv(
45
df_small_noise_url, parse_dates=True, index_col="timestamp"
46
)
47
48
df_daily_jumpsup_url_suffix = "artificialWithAnomaly/art_daily_jumpsup.csv"
49
df_daily_jumpsup_url = master_url_root + df_daily_jumpsup_url_suffix
50
df_daily_jumpsup = pd.read_csv(
51
df_daily_jumpsup_url, parse_dates=True, index_col="timestamp"
52
)
53
54
"""
55
## Quick look at the data
56
"""
57
58
print(df_small_noise.head())
59
60
print(df_daily_jumpsup.head())
61
62
"""
63
## Visualize the data
64
### Timeseries data without anomalies
65
66
We will use the following data for training.
67
"""
68
fig, ax = plt.subplots()
69
df_small_noise.plot(legend=False, ax=ax)
70
plt.show()
71
72
"""
73
### Timeseries data with anomalies
74
75
We will use the following data for testing and see if the sudden jump up in the
76
data is detected as an anomaly.
77
"""
78
fig, ax = plt.subplots()
79
df_daily_jumpsup.plot(legend=False, ax=ax)
80
plt.show()
81
82
"""
83
## Prepare training data
84
85
Get data values from the training timeseries data file and normalize the
86
`value` data. We have a `value` for every 5 mins for 14 days.
87
88
- 24 * 60 / 5 = **288 timesteps per day**
89
- 288 * 14 = **4032 data points** in total
90
"""
91
92
93
# Normalize and save the mean and std we get,
94
# for normalizing test data.
95
training_mean = df_small_noise.mean()
96
training_std = df_small_noise.std()
97
df_training_value = (df_small_noise - training_mean) / training_std
98
print("Number of training samples:", len(df_training_value))
99
100
"""
101
### Create sequences
102
Create sequences combining `TIME_STEPS` contiguous data values from the
103
training data.
104
"""
105
106
TIME_STEPS = 288
107
108
109
# Generated training sequences for use in the model.
110
def create_sequences(values, time_steps=TIME_STEPS):
111
output = []
112
for i in range(len(values) - time_steps + 1):
113
output.append(values[i : (i + time_steps)])
114
return np.stack(output)
115
116
117
x_train = create_sequences(df_training_value.values)
118
print("Training input shape: ", x_train.shape)
119
120
"""
121
## Build a model
122
123
We will build a convolutional reconstruction autoencoder model. The model will
124
take input of shape `(batch_size, sequence_length, num_features)` and return
125
output of the same shape. In this case, `sequence_length` is 288 and
126
`num_features` is 1.
127
"""
128
129
model = keras.Sequential(
130
[
131
layers.Input(shape=(x_train.shape[1], x_train.shape[2])),
132
layers.Conv1D(
133
filters=32,
134
kernel_size=7,
135
padding="same",
136
strides=2,
137
activation="relu",
138
),
139
layers.Dropout(rate=0.2),
140
layers.Conv1D(
141
filters=16,
142
kernel_size=7,
143
padding="same",
144
strides=2,
145
activation="relu",
146
),
147
layers.Conv1DTranspose(
148
filters=16,
149
kernel_size=7,
150
padding="same",
151
strides=2,
152
activation="relu",
153
),
154
layers.Dropout(rate=0.2),
155
layers.Conv1DTranspose(
156
filters=32,
157
kernel_size=7,
158
padding="same",
159
strides=2,
160
activation="relu",
161
),
162
layers.Conv1DTranspose(filters=1, kernel_size=7, padding="same"),
163
]
164
)
165
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss="mse")
166
model.summary()
167
168
"""
169
## Train the model
170
171
Please note that we are using `x_train` as both the input and the target
172
since this is a reconstruction model.
173
"""
174
175
history = model.fit(
176
x_train,
177
x_train,
178
epochs=50,
179
batch_size=128,
180
validation_split=0.1,
181
callbacks=[
182
keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, mode="min")
183
],
184
)
185
186
"""
187
Let's plot training and validation loss to see how the training went.
188
"""
189
190
plt.plot(history.history["loss"], label="Training Loss")
191
plt.plot(history.history["val_loss"], label="Validation Loss")
192
plt.legend()
193
plt.show()
194
195
"""
196
## Detecting anomalies
197
198
We will detect anomalies by determining how well our model can reconstruct
199
the input data.
200
201
202
1. Find MAE loss on training samples.
203
2. Find max MAE loss value. This is the worst our model has performed trying
204
to reconstruct a sample. We will make this the `threshold` for anomaly
205
detection.
206
3. If the reconstruction loss for a sample is greater than this `threshold`
207
value then we can infer that the model is seeing a pattern that it isn't
208
familiar with. We will label this sample as an `anomaly`.
209
210
211
"""
212
213
# Get train MAE loss.
214
x_train_pred = model.predict(x_train)
215
train_mae_loss = np.mean(np.abs(x_train_pred - x_train), axis=1)
216
217
plt.hist(train_mae_loss, bins=50)
218
plt.xlabel("Train MAE loss")
219
plt.ylabel("No of samples")
220
plt.show()
221
222
# Get reconstruction loss threshold.
223
threshold = np.max(train_mae_loss)
224
print("Reconstruction error threshold: ", threshold)
225
226
"""
227
### Compare recontruction
228
229
Just for fun, let's see how our model has recontructed the first sample.
230
This is the 288 timesteps from day 1 of our training dataset.
231
"""
232
233
# Checking how the first sequence is learnt
234
plt.plot(x_train[0])
235
plt.plot(x_train_pred[0])
236
plt.show()
237
238
"""
239
### Prepare test data
240
"""
241
242
243
df_test_value = (df_daily_jumpsup - training_mean) / training_std
244
fig, ax = plt.subplots()
245
df_test_value.plot(legend=False, ax=ax)
246
plt.show()
247
248
# Create sequences from test values.
249
x_test = create_sequences(df_test_value.values)
250
print("Test input shape: ", x_test.shape)
251
252
# Get test MAE loss.
253
x_test_pred = model.predict(x_test)
254
test_mae_loss = np.mean(np.abs(x_test_pred - x_test), axis=1)
255
test_mae_loss = test_mae_loss.reshape((-1))
256
257
plt.hist(test_mae_loss, bins=50)
258
plt.xlabel("test MAE loss")
259
plt.ylabel("No of samples")
260
plt.show()
261
262
# Detect all the samples which are anomalies.
263
anomalies = test_mae_loss > threshold
264
print("Number of anomaly samples: ", np.sum(anomalies))
265
print("Indices of anomaly samples: ", np.where(anomalies))
266
267
"""
268
## Plot anomalies
269
270
We now know the samples of the data which are anomalies. With this, we will
271
find the corresponding `timestamps` from the original test data. We will be
272
using the following method to do that:
273
274
Let's say time_steps = 3 and we have 10 training values. Our `x_train` will
275
look like this:
276
277
- 0, 1, 2
278
- 1, 2, 3
279
- 2, 3, 4
280
- 3, 4, 5
281
- 4, 5, 6
282
- 5, 6, 7
283
- 6, 7, 8
284
- 7, 8, 9
285
286
All except the initial and the final time_steps-1 data values, will appear in
287
`time_steps` number of samples. So, if we know that the samples
288
[(3, 4, 5), (4, 5, 6), (5, 6, 7)] are anomalies, we can say that the data point
289
5 is an anomaly.
290
"""
291
292
# data i is an anomaly if samples [(i - timesteps + 1) to (i)] are anomalies
293
anomalous_data_indices = []
294
for data_idx in range(TIME_STEPS - 1, len(df_test_value) - TIME_STEPS + 1):
295
if np.all(anomalies[data_idx - TIME_STEPS + 1 : data_idx]):
296
anomalous_data_indices.append(data_idx)
297
298
"""
299
Let's overlay the anomalies on the original test data plot.
300
"""
301
302
df_subset = df_daily_jumpsup.iloc[anomalous_data_indices]
303
fig, ax = plt.subplots()
304
df_daily_jumpsup.plot(legend=False, ax=ax)
305
df_subset.plot(legend=False, ax=ax, color="r")
306
plt.show()
307
308