Anomaly detection, a short tutorial using Python__python

Source: Internet
Author: User

Anomaly detection is the problem of identifying data points this don ' t conform to expected (normal) behaviour. Unexpected data points are also known as outliers and exceptions etc. Anomaly detection has crucial significance in the wide variety of domains as it provides critical and actionable On. For example, a anomaly in MRI image scan could is an indication to the malignant tumour or anomalous reading from product Ion plant sensor may indicate faulty component.

Simply, anomaly detection is the task of defining a boundary around normal data points then that they can be distinguishable From outliers. But several different factors make this notion to defining normality very. e.g. normal behaviour usually evolve in certain domains and the notion of that are considered normal in the present could Chan GE in future. Moreover, defining the normal region which separates outliers from normal data points was not straightforward in itself.

In this tutorial, we'll implement anomaly detection algorithm (in Python) to detect outliers in computer. This algorithm was dissuced by Andrew Ng into his course of machine Learning on Coursera. To keep things simple we'll use two features 1) throughput in MB/s and 2) latency in MS of response for each server. The Gaussian model is used to learn a underlying pattern of the dataset with the hope, our features follow the G Aussian distribution. After that, we'll find the data points with very-probabilities of being normal and hence can be considered outliers. For training set, we'll learn the Gaussian distribution of each feature for which mean and variance of features AR E Required. NumPy provides to calculate both mean and variance (covariance matrix) efficiently. Similarly, SciPy Library provide method to estimate Gaussian distribution.

Let ' s get started! By the importing Requried libraries and defining functions for reading data, mean normalizing, features and estimating GA Ussian distribution.

Import Matplotlib.pyplot as Plt
import numpy as NP
%matplotlib inline from

numpy import genfromtxt from
s Cipy.stats import multivariate_normal from
sklearn.metrics import F1_score
def Read_dataset (filepath,delimiter= ', '): Return Genfromtxt (FilePath, Delimiter=delimiter) def feature_normalize (da Taset): mu = Np.mean (dataset,axis=0) sigma = NP.STD (dataset,axis=0) return (DATASET-MU)/sigma def Estimateg Aussian (DataSet): Mu = Np.mean (DataSet, axis=0) Sigma = Np.cov (DataSet.
    T) return mu, Sigma def Multivariategaussian (dataset,mu,sigma): p = multivariate_normal (Mean=mu, Cov=sigma) Return P.pdf (DataSet) 

Next, define a function to find the optimal value of threshold (epsilon) that can is used to differentiate between Mal and anomalous data points. For learning the optimal value of epsilon we'll try different values in a range of learned probabilities on a Cross-vali Dation set. The f-score is calculated for predicted anomalies based on the ground data truth. The epsilon value with highest F-score is selected as threshold i.e. the probabilities that lie below the selected th Reshold would be considered anomalous.

def SELECTTHRESHOLDBYCV (PROBS,GT):
    best_epsilon = 0
    best_f1 = 0
    f = 0
    stepsize = (max (probs)-min (probs) )/1000;
    Epsilons = Np.arange (min (probs), Max (probs), stepsize) for
    Epsilon in Np.nditer (epsilons):
        predictions = (probs < epsilon)
        f = f1_score (GT, predictions, average = "binary")
        if f > best_f1:
            best_f1 = f
            Best_epsil On = Epsilon return
    best_f1, Best_epsilon

We have all the required pieces, and next let's call above defined functions to find anomalies in the dataset. Also, as we are dealing with only two features here, plotting helps us visualize the anomalous data points.

Tr_data = Read_dataset (' tr_server_data.csv ') 
cv_data = Read_dataset (' cv_server_data.csv ') 
gt_data = Read_ DataSet (' Gt_server_data.csv ')

n_training_samples = tr_data.shape[0]
N_dim = tr_data.shape[1]

Plt.figure ()
Plt.xlabel ("latency (ms)")
Plt.ylabel ("Throughput (MB/s)")
Plt.plot (tr_data[:,0],tr_ data[:,1], "BX")
Plt.show ()

Mu, sigma = Estimategaussian (tr_data)
p = Multivariategaussian (tr_data,mu,sigma)

P_CV = Multivariategaussian ( Cv_data,mu,sigma)
fscore, EP = SELECTTHRESHOLDBYCV (P_cv,gt_data)
outliers = Np.asarray (Np.where (P < EP))

plt.figure () 
Plt.xlabel ("latency (ms)") 
Plt.ylabel ("Throughput (MB/s)") 
Plt.plot (tr_data[:,0), tr_data[:,1], "BX") Plt.plot (tr_data[outliers,0],tr_data[outliers,1], "ro") 
plt.show ()

We implemented a very simple anomaly detection algorithm. To gain more in-depth knowledge, please consult following Resource:chandola, Varun, Arindam Banerjee, and Vipin Kumar. "Anomaly detection:a survey." ACM Computing Surveys (CSUR) 41.3 (2009): 15.

The complete code (Python notebook) and the dataset is available at the following link.

Last updated:24/1/2017 http://aqibsaeed.github.io/2016-07-17-anomaly-detection/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.