Overview of anomaly detection-isolated forests and local anomaly factor algorithms are the best results

Source: Internet
Author: User
Tags svm

Go from blog:http://www. infosec-wiki.com/? p=140760

I. About anomaly detection

Anomaly detection (outlier detection) in the following scenario:

    • Data preprocessing
    • Virus Trojan Detection
    • Industrial Manufacturing Product Testing
    • Network traffic detection

And so on, has the important function. Because in the above scenario, the data volume of the exception is a very small part, so such as: SVM, logistic regression and other classification algorithms, are not applicable, because:

The supervised learning algorithm is suitable for a large number of positive samples, there are a large number of negative samples, there are enough samples to allow the algorithm to learn its characteristics, and the future of new samples and training sample distribution consistent.

The following are the applicable scopes for anomaly detection and supervised learning-related algorithms:

Anomaly detection: Credit card fraud, manufacturing product anomaly detection, data center machine anomaly detection, intrusion detection

Supervised learning: spam identification, news classification

Second, anomaly detection algorithm 1. Based on statistics and data distribution

Suppose that the data set should meet the normal distribution (normal distribution), i.e.:

The average of the distributions is μ and the variance is σ2.

When the normal distribution of the training data is met, if the value of x is greater than 4 or less than-4, it can be considered an outlier.

The following is an example of "600680" stock turnover:

Import tusharefrom matplotlib import pyplot as plt df = tushare.  Get_hist_data("600680")v = DF[-+: ].  Volumev.  Plot("KDE")plt.  Show()                 

In the last three months, volume greater than 200000 can be considered an anomaly (the number of days, well, to pay attention to the risk of ...) )

Algorithm Example:

2. Box-line Diagram analysis

Box line diagram, do not do too much explanation:

Import tusharefrom matplotlib import pyplot as plt df = tushare.  Get_hist_data("600680")v = DF[-+: ].  Volumev.  Plot("KDE")plt.  Show()                 

Figure:

import tusharefrom matplotlib import pyplot as plt df = tushare.get_hist_data("600680")v = df[-90: ].volumev.plot("kde")plt.show()

Generally know that the stock in the volume of less than 20000, or volume greater than 80000, it should be increased vigilance!

3. Distance/density based

Typical algorithm is: "Local anomaly factor algorithm-local outlier Factor", the algorithm through the introduction of "k-distance, K distance", "K-distance neighborhood, K distance Neighborhood", "Reach-distance, Can reach the distance ", as well as" local reachability density, locally up to density "and" local outlier factor, partial outlier factor ", to find the anomaly, details can be consulted: anomaly/outlier detection algorithm--lof- wangyibo0201 's Blog-Blog Channel-csdn.net

4. Based on the concept of division

The typical algorithm is "isolated forest, isolation Forest", the idea is:

Let's say we Cut (split) the data space with a random hyper-plane, and cut it up to generate two sub-spaces (imagine cutting the cake in two in two). We then continue to use a random hyper-plane to cut each subspace, looping down until there is only one data point in each subspace. Intuitively, we can see that those clusters that are very dense can be cut many times before they stop cutting, but those with very low densities are very easy to get to a subspace early.

The algorithm flow is to use the super plane to split the subspace, and then establish a similar two-fork tree process:

For details, see:

Iforest (isolation Forest) isolated forest anomaly detection introductory article isolationforest example

Example code:

Import NumPy as Npimport Matplotlib.pyplot as Pltfrom sklearn.ensemble import isolationforestrng = Np.random.RandomState ( # Generate Train datax = 0.3 * RNG.RANDN (2) X_train = np.r_[x + 1, X-3, X-5, X + 6]# Generate some regular nove L OBSERVATIONSX = 0.3 * RNG.RANDN (2) x_test = np.r_[x + 1, X-3, X-5, X + 6]# Generate some abnormal novel Observati Onsx_outliers = Rng.uniform (low=-8, High=8, size= (2)) # fit the MODELCLF = Isolationforest (max_samples=100*2, random_s TATE=RNG) Clf.fit (x_train) Y_pred_train = clf.predict (x_train) y_pred_test = clf.predict (x_test) y_pred_outliers = Clf.predict (x_outliers) # Plot The line, the samples, and the nearest vectors to the Planexx, yy = Np.meshgrid (Np.linspace ( -8, 8, (), Np.linspace ( -8, 8,)) Z = clf.decision_function (Np.c_[xx.ravel (), Yy.ravel ()]) z = z.reshape (xx.shape) Plt.title ("Isolationforest") Plt.contourf (xx, yy, Z, cmap=plt.cm.blues_r) B1 = Plt.scatter (x_train[:, 0], x_train[:, 1], c= ' white ') b2 = Plt.scatter (x_test[:, 0], x_test[:, 1], c= ' green ') c = Plt.scatter (x_outliers[:, 0], x_outliers[:, 1], c= ' red ') plt.axis (' Tight ') Plt.xlim (( -8, 8)) Plt.ylim (( -8, 8)) Plt.legend ([B1, B2, C], ["Training Observations", "new regular Observations", "new abnormal Observations "], loc=" upper left ") plt.show ()

The results are as follows: Red is the anomaly, White is the training set, and green is the test data.

Note: Isolated forests do not apply to particularly high-dimensional data. Since each tangent data space is randomly selected one dimension, there is still a large number of dimension information is not used after the tree is built, resulting in reduced reliability of the algorithm. High-dimensional spaces may also have a large number of noise dimensions or unrelated dimensions (irrelevant attributes) that affect tree construction. The isolated forest algorithm has linear time complexity. Because it is a ensemble method, it can be used on datasets containing large amounts of data. The more trees are usually the more stable the algorithm. Since each tree is built independently of each other, it can be deployed on large scale distributed systems to accelerate operations.

5. Other algorithms

Includes: One-class SVM and Elliptic Envelope.

Reference: 2.7. Novelty and outlier Detection

6. Worth mentioning

In these algorithms, the isolated forest and the local anomaly factor algorithm, in contrast, the effect is the best.

Article reference:

Isolation forestlof–identifying density-based Local outlierskamidox.com anomaly Detection (anomaly Detection) How to implement these five powerful probability distributions in Python-python-bole online blog channel-CSDN. NET2.7. Novelty and outlier detectioniforest (isolation Forest) detection of outliers in isolated forest anomaly detection algorithm (i.)
Outlier detection algorithm (two)
Anomaly detection Algorithm (III)
A survey of anomaly detection algorithms

Overview of anomaly detection-isolated forests and local anomaly factor algorithms are the best results

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.