Anomaly detection, a short tutorial using Python_

Anomaly detection, a short tutorial using Python__python

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Anomaly detection is the problem of identifying data points this don ' t conform to expected (normal) behaviour. Unexpected data points are also known as outliers and exceptions etc. Anomaly detection has crucial significance in the wide variety of domains as it provides critical and actionable On. For example, a anomaly in MRI image scan could is an indication to the malignant tumour or anomalous reading from product Ion plant sensor may indicate faulty component.

Simply, anomaly detection is the task of defining a boundary around normal data points then that they can be distinguishable From outliers. But several different factors make this notion to defining normality very. e.g. normal behaviour usually evolve in certain domains and the notion of that are considered normal in the present could Chan GE in future. Moreover, defining the normal region which separates outliers from normal data points was not straightforward in itself.

In this tutorial, we'll implement anomaly detection algorithm (in Python) to detect outliers in computer. This algorithm was dissuced by Andrew Ng into his course of machine Learning on Coursera. To keep things simple we'll use two features 1) throughput in MB/s and 2) latency in MS of response for each server. The Gaussian model is used to learn a underlying pattern of the dataset with the hope, our features follow the G Aussian distribution. After that, we'll find the data points with very-probabilities of being normal and hence can be considered outliers. For training set, we'll learn the Gaussian distribution of each feature for which mean and variance of features AR E Required. NumPy provides to calculate both mean and variance (covariance matrix) efficiently. Similarly, SciPy Library provide method to estimate Gaussian distribution.

Let ' s get started! By the importing Requried libraries and defining functions for reading data, mean normalizing, features and estimating GA Ussian distribution.

Import Matplotlib.pyplot as Plt
import numpy as NP
%matplotlib inline from

numpy import genfromtxt from
s Cipy.stats import multivariate_normal from
sklearn.metrics import F1_score

def Read_dataset (filepath,delimiter= ', '): Return Genfromtxt (FilePath, Delimiter=delimiter) def feature_normalize (da Taset): mu = Np.mean (dataset,axis=0) sigma = NP.STD (dataset,axis=0) return (DATASET-MU)/sigma def Estimateg Aussian (DataSet): Mu = Np.mean (DataSet, axis=0) Sigma = Np.cov (DataSet.
    T) return mu, Sigma def Multivariategaussian (dataset,mu,sigma): p = multivariate_normal (Mean=mu, Cov=sigma) Return P.pdf (DataSet)

Next, define a function to find the optimal value of threshold (epsilon) that can is used to differentiate between Mal and anomalous data points. For learning the optimal value of epsilon we'll try different values in a range of learned probabilities on a Cross-vali Dation set. The f-score is calculated for predicted anomalies based on the ground data truth. The epsilon value with highest F-score is selected as threshold i.e. the probabilities that lie below the selected th Reshold would be considered anomalous.

def SELECTTHRESHOLDBYCV (PROBS,GT):
    best_epsilon = 0
    best_f1 = 0
    f = 0
    stepsize = (max (probs)-min (probs) )/1000;
    Epsilons = Np.arange (min (probs), Max (probs), stepsize) for
    Epsilon in Np.nditer (epsilons):
        predictions = (probs < epsilon)
        f = f1_score (GT, predictions, average = "binary")
        if f > best_f1:
            best_f1 = f
            Best_epsil On = Epsilon return
    best_f1, Best_epsilon

We have all the required pieces, and next let's call above defined functions to find anomalies in the dataset. Also, as we are dealing with only two features here, plotting helps us visualize the anomalous data points.

Tr_data = Read_dataset (' tr_server_data.csv ') 
cv_data = Read_dataset (' cv_server_data.csv ') 
gt_data = Read_ DataSet (' Gt_server_data.csv ')

n_training_samples = tr_data.shape[0]
N_dim = tr_data.shape[1]

Plt.figure ()
Plt.xlabel ("latency (ms)")
Plt.ylabel ("Throughput (MB/s)")
Plt.plot (tr_data[:,0],tr_ data[:,1], "BX")
Plt.show ()

Mu, sigma = Estimategaussian (tr_data)
p = Multivariategaussian (tr_data,mu,sigma)

P_CV = Multivariategaussian ( Cv_data,mu,sigma)
fscore, EP = SELECTTHRESHOLDBYCV (P_cv,gt_data)
outliers = Np.asarray (Np.where (P < EP))

plt.figure () 
Plt.xlabel ("latency (ms)") 
Plt.ylabel ("Throughput (MB/s)") 
Plt.plot (tr_data[:,0), tr_data[:,1], "BX") Plt.plot (tr_data[outliers,0],tr_data[outliers,1], "ro") 
plt.show ()

We implemented a very simple anomaly detection algorithm. To gain more in-depth knowledge, please consult following Resource:chandola, Varun, Arindam Banerjee, and Vipin Kumar. "Anomaly detection:a survey." ACM Computing Surveys (CSUR) 41.3 (2009): 15.

The complete code (Python notebook) and the dataset is available at the following link.

Last updated:24/1/2017 http://aqibsaeed.github.io/2016-07-17-anomaly-detection/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Anomaly detection, a short tutorial using Python__python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Anomaly detection, a short tutorial using Python__python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support