Path: blob/master/3_Unsupervised_Machine_Learning/Week 1. Unsupervised Learning/C3_W1_Anomaly_Detection.ipynb
2826 views
Anomaly Detection
In this exercise, you will implement the anomaly detection algorithm and apply it to detect failing servers on a network.
Outline
1 - Packages
First, let's run the cell below to import all the packages that you will need during this assignment.
numpy is the fundamental package for working with matrices in Python.
matplotlib is a famous library to plot graphs in Python.
utils.py
contains helper functions for this assignment. You do not need to modify code in this file.
2 - Anomaly detection
2.1 Problem Statement
In this exercise, you will implement an anomaly detection algorithm to detect anomalous behavior in server computers.
The dataset contains two features -
throughput (mb/s) and
latency (ms) of response of each server.
While your servers were operating, you collected examples of how they were behaving, and thus have an unlabeled dataset .
You suspect that the vast majority of these examples are “normal” (non-anomalous) examples of the servers operating normally, but there might also be some examples of servers acting anomalously within this dataset.
You will use a Gaussian model to detect anomalous examples in your dataset.
You will first start on a 2D dataset that will allow you to visualize what the algorithm is doing.
On that dataset you will fit a Gaussian distribution and then find values that have very low probability and hence can be considered anomalies.
After that, you will apply the anomaly detection algorithm to a larger dataset with many dimensions.
2.2 Dataset
You will start by loading the dataset for this task.
The
load_data()
function shown below loads the data into the variablesX_train
,X_val
andy_val
You will use
X_train
to fit a Gaussian distributionYou will use
X_val
andy_val
as a cross validation set to select a threshold and determine anomalous vs normal examples
View the variables
Let's get more familiar with your dataset.
A good place to start is to just print out each variable and see what it contains.
The code below prints the first five elements of each of the variables
Check the dimensions of your variables
Another useful way to get familiar with your data is to view its dimensions.
The code below prints the shape of X_train
, X_val
and y_val
.
Visualize your data
Before starting on any task, it is often useful to understand the data by visualizing it.
For this dataset, you can use a scatter plot to visualize the data (
X_train
), since it has only two properties to plot (throughput and latency)Your plot should look similar to the one below
2.3 Gaussian distribution
To perform anomaly detection, you will first need to fit a model to the data’s distribution.
Given a training set you want to estimate the Gaussian distribution for each of the features .
Recall that the Gaussian distribution is given by
where is the mean and controls the variance.
For each feature , you need to find parameters and that fit the data in the -th dimension (the -th dimension of each example).
2.2.1 Estimating parameters for a Gaussian
Implementation:
Your task is to complete the code in estimate_gaussian
below.
Exercise 1
Please complete the estimate_gaussian
function below to calculate mu
(mean for each feature in X
)and var
(variance for each feature in X
).
You can estimate the parameters, (, ), of the -th feature by using the following equations. To estimate the mean, you will use:
and for the variance you will use:
If you get stuck, you can check out the hints presented after the cell below to help you with the implementation.
Click for hints
You can implement this function in two ways:
1 - by having two nested for loops - one looping over the columns of
X
(each feature) and then looping over each data point.2 - in a vectorized manner by using
np.sum()
withaxis = 0
parameter (since we want the sum for each column)
Here's how you can structure the overall implementation of this function for the vectorized implementation:
You can check if your implementation is correct by running the following test code:
Expected Output:
Mean of each feature: | [14.11222578 14.99771051] |
Variance of each feature: | [1.83263141 1.70974533] |
Now that you have completed the code in estimate_gaussian
, we will visualize the contours of the fitted Gaussian distribution.
You should get a plot similar to the figure below.
From your plot you can see that most of the examples are in the region with the highest probability, while the anomalous examples are in the regions with lower probabilities.
2.2.2 Selecting the threshold
Now that you have estimated the Gaussian parameters, you can investigate which examples have a very high probability given this distribution and which examples have a very low probability.
The low probability examples are more likely to be the anomalies in our dataset.
One way to determine which examples are anomalies is to select a threshold based on a cross validation set.
In this section, you will complete the code in select_threshold
to select the threshold using the score on a cross validation set.
For this, we will use a cross validation set , where the label corresponds to an anomalous example, and corresponds to a normal example.
For each cross validation example, we will compute . The vector of all of these probabilities is passed to
select_threshold
in the vectorp_val
.The corresponding labels is passed to the same function in the vector
y_val
.
Exercise 2
Please complete the select_threshold
function below to find the best threshold to use for selecting outliers based on the results from a validation set (p_val
) and the ground truth (y_val
).
In the provided code
select_threshold
, there is already a loop that will try many different values of and select the best based on the score.You need implement code to calculate the F1 score from choosing
epsilon
as the threshold and place the value inF1
.Recall that if an example has a low probability , then it is classified as an anomaly.
Then, you can compute precision and recall by: where
is the number of true positives: the ground truth label says it’s an anomaly and our algorithm correctly classified it as an anomaly.
is the number of false positives: the ground truth label says it’s not an anomaly, but our algorithm incorrectly classified it as an anomaly.
is the number of false negatives: the ground truth label says it’s an anomaly, but our algorithm incorrectly classified it as not being anomalous.
The score is computed using precision () and recall () as follows:
Implementation Note: In order to compute , and , you may be able to use a vectorized implementation rather than loop over all the examples.
If you get stuck, you can check out the hints presented after the cell below to help you with the implementation.
Click for hints
Here's how you can structure the overall implementation of this function for the vectorized implementation:
binary vector, you can find out how many values in this vector are 0 by using: np.sum(v == 0)
You can also apply a logical and operator to such binary vectors. For instance, predictions
is a binary vector of the size of your number of cross validation set, where the -th element is 1 if your algorithm considers an anomaly, and 0 otherwise. You can then, for example, compute the number of false positives using:
fp = sum((predictions == 1) & (y_val == 0))
. You can compute tp as
tp = np.sum((predictions == 1) & (y_val == 1))
You can compute tn as fn = np.sum((predictions == 0) & (y_val == 1))
You can check your implementation using the code below
Expected Output:
Best epsilon found using cross-validation: | 8.99e-05 |
Best F1 on Cross Validation Set: | 0.875 |
Now we will run your anomaly detection code and circle the anomalies in the plot (Figure 3 below).
2.4 High dimensional dataset
Now, we will run the anomaly detection algorithm that you implemented on a more realistic and much harder dataset.
In this dataset, each example is described by 11 features, capturing many more properties of your compute servers.
Let's start by loading the dataset.
The
load_data()
function shown below loads the data into variablesX_train_high
,X_val_high
andy_val_high
_high
is meant to distinguish these variables from the ones used in the previous partWe will use
X_train_high
to fit Gaussian distributionWe will use
X_val_high
andy_val_high
as a cross validation set to select a threshold and determine anomalous vs normal examples
Check the dimensions of your variables
Let's check the dimensions of these new variables to become familiar with the data
Anomaly detection
Now, let's run the anomaly detection algorithm on this new dataset.
The code below will use your code to
Estimate the Gaussian parameters ( and )
Evaluate the probabilities for both the training data
X_train_high
from which you estimated the Gaussian parameters, as well as for the the cross-validation setX_val_high
.Finally, it will use
select_threshold
to find the best threshold .
Expected Output:
Best epsilon found using cross-validation: | 1.38e-18 |
Best F1 on Cross Validation Set: | 0.615385 |
# anomalies found: | 117 |