CoCalc -- timeseries_anomaly

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/timeseries/timeseries_anomaly_detection.py
³⁵⁰⁷ views
1
"""
2
Title: Timeseries anomaly detection using an Autoencoder
3
Author: [pavithrasv](https://github.com/pavithrasv)
4
Date created: 2020/05/31
5
Last modified: 2020/05/31
6
Description: Detect anomalies in a timeseries using an Autoencoder.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
This script demonstrates how you can use a reconstruction convolutional
14
autoencoder model to detect anomalies in timeseries data.
15
"""
16

17
"""
18
## Setup
19
"""
20

21
import numpy as np
22
import pandas as pd
23
import keras
24
from keras import layers
25
from matplotlib import pyplot as plt
26

27
"""
28
## Load the data
29

30
We will use the [Numenta Anomaly Benchmark(NAB)](
31
https://www.kaggle.com/boltzmannbrain/nab) dataset. It provides artificial
32
timeseries data containing labeled anomalous periods of behavior. Data are
33
ordered, timestamped, single-valued metrics.
34

35
We will use the `art_daily_small_noise.csv` file for training and the
36
`art_daily_jumpsup.csv` file for testing. The simplicity of this dataset
37
allows us to demonstrate anomaly detection effectively.
38
"""
39

40
master_url_root = "https://raw.githubusercontent.com/numenta/NAB/master/data/"
41

42
df_small_noise_url_suffix = "artificialNoAnomaly/art_daily_small_noise.csv"
43
df_small_noise_url = master_url_root + df_small_noise_url_suffix
44
df_small_noise = pd.read_csv(
45
    df_small_noise_url, parse_dates=True, index_col="timestamp"
46
)
47

48
df_daily_jumpsup_url_suffix = "artificialWithAnomaly/art_daily_jumpsup.csv"
49
df_daily_jumpsup_url = master_url_root + df_daily_jumpsup_url_suffix
50
df_daily_jumpsup = pd.read_csv(
51
    df_daily_jumpsup_url, parse_dates=True, index_col="timestamp"
52
)
53

54
"""
55
## Quick look at the data
56
"""
57

58
print(df_small_noise.head())
59

60
print(df_daily_jumpsup.head())
61

62
"""
63
## Visualize the data
64
### Timeseries data without anomalies
65

66
We will use the following data for training.
67
"""
68
fig, ax = plt.subplots()
69
df_small_noise.plot(legend=False, ax=ax)
70
plt.show()
71

72
"""
73
### Timeseries data with anomalies
74

75
We will use the following data for testing and see if the sudden jump up in the
76
data is detected as an anomaly.
77
"""
78
fig, ax = plt.subplots()
79
df_daily_jumpsup.plot(legend=False, ax=ax)
80
plt.show()
81

82
"""
83
## Prepare training data
84

85
Get data values from the training timeseries data file and normalize the
86
`value` data. We have a `value` for every 5 mins for 14 days.
87

88
-   24 * 60 / 5 = **288 timesteps per day**
89
-   288 * 14 = **4032 data points** in total
90
"""
91

92

93
# Normalize and save the mean and std we get,
94
# for normalizing test data.
95
training_mean = df_small_noise.mean()
96
training_std = df_small_noise.std()
97
df_training_value = (df_small_noise - training_mean) / training_std
98
print("Number of training samples:", len(df_training_value))
99

100
"""
101
### Create sequences
102
Create sequences combining `TIME_STEPS` contiguous data values from the
103
training data.
104
"""
105

106
TIME_STEPS = 288
107

108

109
# Generated training sequences for use in the model.
110
def create_sequences(values, time_steps=TIME_STEPS):
111
    output = []
112
    for i in range(len(values) - time_steps + 1):
113
        output.append(values[i : (i + time_steps)])
114
    return np.stack(output)
115

116

117
x_train = create_sequences(df_training_value.values)
118
print("Training input shape: ", x_train.shape)
119

120
"""
121
## Build a model
122

123
We will build a convolutional reconstruction autoencoder model. The model will
124
take input of shape `(batch_size, sequence_length, num_features)` and return
125
output of the same shape. In this case, `sequence_length` is 288 and
126
`num_features` is 1.
127
"""
128

129
model = keras.Sequential(
130
    [
131
        layers.Input(shape=(x_train.shape[1], x_train.shape[2])),
132
        layers.Conv1D(
133
            filters=32,
134
            kernel_size=7,
135
            padding="same",
136
            strides=2,
137
            activation="relu",
138
        ),
139
        layers.Dropout(rate=0.2),
140
        layers.Conv1D(
141
            filters=16,
142
            kernel_size=7,
143
            padding="same",
144
            strides=2,
145
            activation="relu",
146
        ),
147
        layers.Conv1DTranspose(
148
            filters=16,
149
            kernel_size=7,
150
            padding="same",
151
            strides=2,
152
            activation="relu",
153
        ),
154
        layers.Dropout(rate=0.2),
155
        layers.Conv1DTranspose(
156
            filters=32,
157
            kernel_size=7,
158
            padding="same",
159
            strides=2,
160
            activation="relu",
161
        ),
162
        layers.Conv1DTranspose(filters=1, kernel_size=7, padding="same"),
163
    ]
164
)
165
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss="mse")
166
model.summary()
167

168
"""
169
## Train the model
170

171
Please note that we are using `x_train` as both the input and the target
172
since this is a reconstruction model.
173
"""
174

175
history = model.fit(
176
    x_train,
177
    x_train,
178
    epochs=50,
179
    batch_size=128,
180
    validation_split=0.1,
181
    callbacks=[
182
        keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, mode="min")
183
    ],
184
)
185

186
"""
187
Let's plot training and validation loss to see how the training went.
188
"""
189

190
plt.plot(history.history["loss"], label="Training Loss")
191
plt.plot(history.history["val_loss"], label="Validation Loss")
192
plt.legend()
193
plt.show()
194

195
"""
196
## Detecting anomalies
197

198
We will detect anomalies by determining how well our model can reconstruct
199
the input data.
200

201

202
1.   Find MAE loss on training samples.
203
2.   Find max MAE loss value. This is the worst our model has performed trying
204
to reconstruct a sample. We will make this the `threshold` for anomaly
205
detection.
206
3.   If the reconstruction loss for a sample is greater than this `threshold`
207
value then we can infer that the model is seeing a pattern that it isn't
208
familiar with. We will label this sample as an `anomaly`.
209

210

211
"""
212

213
# Get train MAE loss.
214
x_train_pred = model.predict(x_train)
215
train_mae_loss = np.mean(np.abs(x_train_pred - x_train), axis=1)
216

217
plt.hist(train_mae_loss, bins=50)
218
plt.xlabel("Train MAE loss")
219
plt.ylabel("No of samples")
220
plt.show()
221

222
# Get reconstruction loss threshold.
223
threshold = np.max(train_mae_loss)
224
print("Reconstruction error threshold: ", threshold)
225

226
"""
227
### Compare recontruction
228

229
Just for fun, let's see how our model has recontructed the first sample.
230
This is the 288 timesteps from day 1 of our training dataset.
231
"""
232

233
# Checking how the first sequence is learnt
234
plt.plot(x_train[0])
235
plt.plot(x_train_pred[0])
236
plt.show()
237

238
"""
239
### Prepare test data
240
"""
241

242

243
df_test_value = (df_daily_jumpsup - training_mean) / training_std
244
fig, ax = plt.subplots()
245
df_test_value.plot(legend=False, ax=ax)
246
plt.show()
247

248
# Create sequences from test values.
249
x_test = create_sequences(df_test_value.values)
250
print("Test input shape: ", x_test.shape)
251

252
# Get test MAE loss.
253
x_test_pred = model.predict(x_test)
254
test_mae_loss = np.mean(np.abs(x_test_pred - x_test), axis=1)
255
test_mae_loss = test_mae_loss.reshape((-1))
256

257
plt.hist(test_mae_loss, bins=50)
258
plt.xlabel("test MAE loss")
259
plt.ylabel("No of samples")
260
plt.show()
261

262
# Detect all the samples which are anomalies.
263
anomalies = test_mae_loss > threshold
264
print("Number of anomaly samples: ", np.sum(anomalies))
265
print("Indices of anomaly samples: ", np.where(anomalies))
266

267
"""
268
## Plot anomalies
269

270
We now know the samples of the data which are anomalies. With this, we will
271
find the corresponding `timestamps` from the original test data. We will be
272
using the following method to do that:
273

274
Let's say time_steps = 3 and we have 10 training values. Our `x_train` will
275
look like this:
276

277
- 0, 1, 2
278
- 1, 2, 3
279
- 2, 3, 4
280
- 3, 4, 5
281
- 4, 5, 6
282
- 5, 6, 7
283
- 6, 7, 8
284
- 7, 8, 9
285

286
All except the initial and the final time_steps-1 data values, will appear in
287
`time_steps` number of samples. So, if we know that the samples
288
[(3, 4, 5), (4, 5, 6), (5, 6, 7)] are anomalies, we can say that the data point
289
5 is an anomaly.
290
"""
291

292
# data i is an anomaly if samples [(i - timesteps + 1) to (i)] are anomalies
293
anomalous_data_indices = []
294
for data_idx in range(TIME_STEPS - 1, len(df_test_value) - TIME_STEPS + 1):
295
    if np.all(anomalies[data_idx - TIME_STEPS + 1 : data_idx]):
296
        anomalous_data_indices.append(data_idx)
297

298
"""
299
Let's overlay the anomalies on the original test data plot.
300
"""
301

302
df_subset = df_daily_jumpsup.iloc[anomalous_data_indices]
303
fig, ax = plt.subplots()
304
df_daily_jumpsup.plot(legend=False, ax=ax)
305
df_subset.plot(legend=False, ax=ax, color="r")
306
plt.show()
307

308
Product

Resources

Company