Path: blob/master/examples/timeseries/timeseries_anomaly_detection.py
3507 views
"""1Title: Timeseries anomaly detection using an Autoencoder2Author: [pavithrasv](https://github.com/pavithrasv)3Date created: 2020/05/314Last modified: 2020/05/315Description: Detect anomalies in a timeseries using an Autoencoder.6Accelerator: GPU7"""89"""10## Introduction1112This script demonstrates how you can use a reconstruction convolutional13autoencoder model to detect anomalies in timeseries data.14"""1516"""17## Setup18"""1920import numpy as np21import pandas as pd22import keras23from keras import layers24from matplotlib import pyplot as plt2526"""27## Load the data2829We will use the [Numenta Anomaly Benchmark(NAB)](30https://www.kaggle.com/boltzmannbrain/nab) dataset. It provides artificial31timeseries data containing labeled anomalous periods of behavior. Data are32ordered, timestamped, single-valued metrics.3334We will use the `art_daily_small_noise.csv` file for training and the35`art_daily_jumpsup.csv` file for testing. The simplicity of this dataset36allows us to demonstrate anomaly detection effectively.37"""3839master_url_root = "https://raw.githubusercontent.com/numenta/NAB/master/data/"4041df_small_noise_url_suffix = "artificialNoAnomaly/art_daily_small_noise.csv"42df_small_noise_url = master_url_root + df_small_noise_url_suffix43df_small_noise = pd.read_csv(44df_small_noise_url, parse_dates=True, index_col="timestamp"45)4647df_daily_jumpsup_url_suffix = "artificialWithAnomaly/art_daily_jumpsup.csv"48df_daily_jumpsup_url = master_url_root + df_daily_jumpsup_url_suffix49df_daily_jumpsup = pd.read_csv(50df_daily_jumpsup_url, parse_dates=True, index_col="timestamp"51)5253"""54## Quick look at the data55"""5657print(df_small_noise.head())5859print(df_daily_jumpsup.head())6061"""62## Visualize the data63### Timeseries data without anomalies6465We will use the following data for training.66"""67fig, ax = plt.subplots()68df_small_noise.plot(legend=False, ax=ax)69plt.show()7071"""72### Timeseries data with anomalies7374We will use the following data for testing and see if the sudden jump up in the75data is detected as an anomaly.76"""77fig, ax = plt.subplots()78df_daily_jumpsup.plot(legend=False, ax=ax)79plt.show()8081"""82## Prepare training data8384Get data values from the training timeseries data file and normalize the85`value` data. We have a `value` for every 5 mins for 14 days.8687- 24 * 60 / 5 = **288 timesteps per day**88- 288 * 14 = **4032 data points** in total89"""909192# Normalize and save the mean and std we get,93# for normalizing test data.94training_mean = df_small_noise.mean()95training_std = df_small_noise.std()96df_training_value = (df_small_noise - training_mean) / training_std97print("Number of training samples:", len(df_training_value))9899"""100### Create sequences101Create sequences combining `TIME_STEPS` contiguous data values from the102training data.103"""104105TIME_STEPS = 288106107108# Generated training sequences for use in the model.109def create_sequences(values, time_steps=TIME_STEPS):110output = []111for i in range(len(values) - time_steps + 1):112output.append(values[i : (i + time_steps)])113return np.stack(output)114115116x_train = create_sequences(df_training_value.values)117print("Training input shape: ", x_train.shape)118119"""120## Build a model121122We will build a convolutional reconstruction autoencoder model. The model will123take input of shape `(batch_size, sequence_length, num_features)` and return124output of the same shape. In this case, `sequence_length` is 288 and125`num_features` is 1.126"""127128model = keras.Sequential(129[130layers.Input(shape=(x_train.shape[1], x_train.shape[2])),131layers.Conv1D(132filters=32,133kernel_size=7,134padding="same",135strides=2,136activation="relu",137),138layers.Dropout(rate=0.2),139layers.Conv1D(140filters=16,141kernel_size=7,142padding="same",143strides=2,144activation="relu",145),146layers.Conv1DTranspose(147filters=16,148kernel_size=7,149padding="same",150strides=2,151activation="relu",152),153layers.Dropout(rate=0.2),154layers.Conv1DTranspose(155filters=32,156kernel_size=7,157padding="same",158strides=2,159activation="relu",160),161layers.Conv1DTranspose(filters=1, kernel_size=7, padding="same"),162]163)164model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss="mse")165model.summary()166167"""168## Train the model169170Please note that we are using `x_train` as both the input and the target171since this is a reconstruction model.172"""173174history = model.fit(175x_train,176x_train,177epochs=50,178batch_size=128,179validation_split=0.1,180callbacks=[181keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, mode="min")182],183)184185"""186Let's plot training and validation loss to see how the training went.187"""188189plt.plot(history.history["loss"], label="Training Loss")190plt.plot(history.history["val_loss"], label="Validation Loss")191plt.legend()192plt.show()193194"""195## Detecting anomalies196197We will detect anomalies by determining how well our model can reconstruct198the input data.1992002011. Find MAE loss on training samples.2022. Find max MAE loss value. This is the worst our model has performed trying203to reconstruct a sample. We will make this the `threshold` for anomaly204detection.2053. If the reconstruction loss for a sample is greater than this `threshold`206value then we can infer that the model is seeing a pattern that it isn't207familiar with. We will label this sample as an `anomaly`.208209210"""211212# Get train MAE loss.213x_train_pred = model.predict(x_train)214train_mae_loss = np.mean(np.abs(x_train_pred - x_train), axis=1)215216plt.hist(train_mae_loss, bins=50)217plt.xlabel("Train MAE loss")218plt.ylabel("No of samples")219plt.show()220221# Get reconstruction loss threshold.222threshold = np.max(train_mae_loss)223print("Reconstruction error threshold: ", threshold)224225"""226### Compare recontruction227228Just for fun, let's see how our model has recontructed the first sample.229This is the 288 timesteps from day 1 of our training dataset.230"""231232# Checking how the first sequence is learnt233plt.plot(x_train[0])234plt.plot(x_train_pred[0])235plt.show()236237"""238### Prepare test data239"""240241242df_test_value = (df_daily_jumpsup - training_mean) / training_std243fig, ax = plt.subplots()244df_test_value.plot(legend=False, ax=ax)245plt.show()246247# Create sequences from test values.248x_test = create_sequences(df_test_value.values)249print("Test input shape: ", x_test.shape)250251# Get test MAE loss.252x_test_pred = model.predict(x_test)253test_mae_loss = np.mean(np.abs(x_test_pred - x_test), axis=1)254test_mae_loss = test_mae_loss.reshape((-1))255256plt.hist(test_mae_loss, bins=50)257plt.xlabel("test MAE loss")258plt.ylabel("No of samples")259plt.show()260261# Detect all the samples which are anomalies.262anomalies = test_mae_loss > threshold263print("Number of anomaly samples: ", np.sum(anomalies))264print("Indices of anomaly samples: ", np.where(anomalies))265266"""267## Plot anomalies268269We now know the samples of the data which are anomalies. With this, we will270find the corresponding `timestamps` from the original test data. We will be271using the following method to do that:272273Let's say time_steps = 3 and we have 10 training values. Our `x_train` will274look like this:275276- 0, 1, 2277- 1, 2, 3278- 2, 3, 4279- 3, 4, 5280- 4, 5, 6281- 5, 6, 7282- 6, 7, 8283- 7, 8, 9284285All except the initial and the final time_steps-1 data values, will appear in286`time_steps` number of samples. So, if we know that the samples287[(3, 4, 5), (4, 5, 6), (5, 6, 7)] are anomalies, we can say that the data point2885 is an anomaly.289"""290291# data i is an anomaly if samples [(i - timesteps + 1) to (i)] are anomalies292anomalous_data_indices = []293for data_idx in range(TIME_STEPS - 1, len(df_test_value) - TIME_STEPS + 1):294if np.all(anomalies[data_idx - TIME_STEPS + 1 : data_idx]):295anomalous_data_indices.append(data_idx)296297"""298Let's overlay the anomalies on the original test data plot.299"""300301df_subset = df_daily_jumpsup.iloc[anomalous_data_indices]302fig, ax = plt.subplots()303df_daily_jumpsup.plot(legend=False, ax=ax)304df_subset.plot(legend=False, ax=ax, color="r")305plt.show()306307308