Summary of anomaly detection algorithm

Source: Internet
Author: User
Tags svm

Anomaly detection, sometimes called outlier detection, English is generally called novelty Detection or outlier Detection, is a relatively common class of unsupervised learning algorithm, here on the anomaly detection algorithm to do a summary.

1. Anomaly Detection Algorithm usage scenario

When do we need an anomaly detection algorithm? There are three common cases. First, when doing feature engineering, it is necessary to filter the abnormal data to prevent the effect of normalization and other processing results. The second is to filter the characteristic data without the marked output, and find out the abnormal data. Third, the marked output of the feature data to do two classification, because some categories of training samples are very small, the category is seriously unbalanced, you can also consider the non-supervised anomaly detection algorithm to do.

2. Anomaly detection algorithms common categories

The purpose of anomaly detection is to find out the data of data set and most data, the common anomaly detection algorithm is divided into three kinds.

The first is a statistical approach to dealing with anomaly data, which typically constructs a probabilistic distribution model and calculates the probability that objects conform to the model, and treats objects with low probabilities as outliers. For example, the Robustscaler method in feature engineering, when doing data eigenvalue scaling, it will use the data characteristics of the division distribution, the data according to the number of partitions divided into multiple segments, only take the middle segment to do the scaling, such as only 25% minutes to 75% minutes to scale data. This reduces the impact of the exception data.

The second class is based on the clustering method to detect anomaly points. This is very well understood, because most of the clustering algorithm is based on the distribution of data features, usually if we cluster found that some clustering data sample size is much less than other clusters, and the data in this cluster, the characteristics of the distribution of the value and other clusters are also very different, the sample points in these clusters are most often anomalies. For example, the Birch Clustering algorithm and the dbscan density clustering algorithm can be used to detect outliers at the same time.

The third class is based on a special anomaly detection algorithm. These algorithms are not like clustering algorithms, detection anomaly is just a giveaway, their purpose is to detect the anomaly, the representative of such an algorithm is one class SVM and isolation Forest.

The following is a detailed discussion of the one Class SVM and isolation forest.

3. One Class SVM algorithm

One Class SVM is also a large family of support vector machines, but it is different from the traditional classification regression support vector machine based on supervised learning, which is a unsupervised learning method, that is, it does not require us to mark the output tag of the training set.

So without the category tag, how do we find the plane of division and find support vectors? One Class SVM has many solutions to this problem. Here is only a special way of thinking Svdd, for SVDD, we expect all samples that are not abnormal are positive categories, and it uses a super-sphere rather than a super-plane to do the division, the algorithm in the feature space to obtain the spherical boundary around the data, the expectation of minimizing the volume of the Super sphere, This minimizes the impact of outlier data.

Assuming that the resulting hyper-sphere parameter is centered $o $ and the corresponding hyper-sphere radius $r >0$, the hyper-sphere volume $V (r) $ is minimized, the center $o$ is a linear combination of support vectors; similar to the traditional SVM method, all training data points $x _{i}$ to center distance are strictly less than $r $, But at the same time constructs a penalty coefficient of $C $ of the relaxation variable $\xi_i$, the optimization problem is as follows:

$$\underbrace{min}_{r,o}v (R) + c\sum\limits_{i=1}^m\xi_i$$ $$| | x_i-o| | _2 \leq R + \xi_i,\;\; i=1,2,... m$$ $$\xi_i \geq 0,\;\;i=1,2,... m$$

A similar solution to the previous support vector machine series, after using Lagrange dual solution, can judge whether the new data points $z $ in the class, if the distance $z$ to the center is less than or equal to the radius $r$, it is not an anomaly, if outside the hyper sphere, it is an anomaly.

In Sklearn, we can use the ONECLASSSVM in the SVM package to do anomaly detection. ONECLASSSVM also supports kernel functions, so the idea of the tuning of common SVM is also applicable here.

4. Isolation Forest algorithm

Isolation Forest (hereinafter referred to as Iforest) is Zhou Zhihua Teacher's students, mainly using integrated learning ideas to do anomaly detection, now almost become the anomaly detection algorithm preferences, I used to. The bagging and random forest algorithm principle Summary section 4.3 also briefly explained the Iforest idea, it is the random forest big family member.

The algorithm itself is not complex, mainly including the first step training to build a random forest corresponding to the decision tree, these decision trees are generally called itree, the second step to calculate the data points to be detected $x$ eventually fall in any number of T Itree layer $h_t (x) $. Then we can draw $x$ in the height average of each tree $h (x) $. The third step is to determine whether the $x$ is an anomaly based on $h (x) $.

For the first step of building a decision tree, the method differs from the normal random forest.

The number of samples in the normal random forest is equal to the number of training sets when the training samples of the decision tree are sampled first. But Iforest does not need to sample so much, in general, the number of samples is much smaller than the number of training sets. The reason is that our aim is anomaly detection, only a subset of the samples we can generally distinguish the anomaly points out.

In addition, when making decision tree splitting decision, we can't calculate the Gini coefficient or the dividing standard such as variance because we have no marked output. Here we use the random selection of the feature, and then randomly select the partition threshold based on this feature to divide the decision tree. Until the depth of the tree reaches the threshold limit or the number of samples remains one.

The second step calculates the height average of the sample points to be detected in each tree $h (x) $. The first step is to traverse each itree and get the detected data points $x$ end up in the number of layers $h_t (x) $ of any T itree. This $h_t (x) $ represents the depth of the tree, that is, the closer to the root node, the smaller the $h_t (x), the closer to the bottom, the greater the $h_t (x) $, the height of the root node is 0.

The third step is to determine whether the $x$ is an anomaly according to $h (x) $. We generally use the following formula to calculate the $x$ anomaly probability score: $ $s (x,m) = 2^{-\frac{h (x)}{c (m)}}$$, $s (x,m) $ The range is [0,1], the closer the value is 1, then the probability of the anomaly is greater. where M is the number of samples. The expression is: $$ C (M) =2\ln (m-1) + \xi-2\frac{m-1}{m}, \; \xi is Euler constant $$

As can be seen from the $s (x,m) $ expression, if the height $h (x) \to 0$, then $s (x,m) \to 1$, that is, the probability of the anomaly is 100%, if the height of $h (x) \to m-1$, then $s (x,m) \to 0$, that is, it is impossible to be an anomaly. If the height $h (x) \to C (M) $, then $s (x,m) \to 0.5$, that is, the probability of the anomaly is 50%, generally we can set the $s (x,m) a threshold and then go to the parameter, so that the threshold is greater than the value of the anomaly.

In Sklearn, we can use the Isolationforest in the ensemble package to do anomaly detection.

5. Anomaly Detection Algorithm Summary

Iforest is currently one of the most commonly used algorithms for anomaly detection, its advantages are very prominent, it has linear time complexity. Because it is a random forest method, it can be used on datasets containing large amounts of data. The more trees are usually the more stable the algorithm. Since each tree is built independently of each other, it can be deployed on large scale distributed systems to accelerate operations. For the current trend of big data analysis, it is useful for a reason.

But Iforest also has some drawbacks, such as not being applied to particularly high-dimensional data. Because each tangent data space is randomly selected a dimension and the dimension of a random feature, after the completion of the tree still have a large number of dimensions are not used, resulting in reduced reliability of the algorithm. It is recommended to use it at this point, or consider using one Class SVM.

In addition, Iforest is only sensitive to the global sparse point, and is not good at dealing with local relative sparse points, so the detection may not be very accurate in some local anomaly points.

And one Class SVM for small and medium-sized data analysis, especially training samples are not particularly large amount of time used often than iforest handy, so it is more suitable for prototyping analysis.

(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])

Summary of anomaly detection algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.