Anomaly Detection Algorithm--isolation Forest

Source: Internet
Author: User

NTU Zhou Zhihua in 2010 an anomaly detection algorithm isolation Forest, is very practical in industry, the algorithm is good, time efficiency is high, can effectively deal with high-dimensional data and massive data, here is a brief summary of this algorithm.

Itree

Refers to the forest, the natural tree, after all, the forest is composed of trees, see Isolation Forest (abbreviated iforest) before, we first look at Isolation Tree (abbreviated Itree) is how to constitute, Itree is a random binary tree, Each node has two daughters, or a leaf node, not a child. Given a heap of data sets D, all of the properties of D here are continuous variables, and the Itree process is as follows:

    • Randomly select an attribute attr;
    • Randomly selects a value of this property;
    • According to attr each record classification, the attr is less than the value of the record on the left daughter, a record greater than equals value in the right child;
    • The left daughter and the right daughter are then recursively constructed until the following conditions are met:
    • The incoming dataset has only one record or multiple records;
    • The height of the tree reaches the limit height;

  

Once the Itree is built, it is possible to predict the data, and the prediction process is to take the Test record on the Itree and see which leaf node the test records fall on. The hypothesis that itree can effectively detect anomalies is that the anomaly is generally very rare, and in the itree it is quickly divided into leaf nodes, so the path h (x) length of the leaf node to the root node can be used to determine if a record x is an anomaly; for a DataSet containing N records, Its constructed tree has a minimum height of log (n) and a maximum value of n-1, and the paper mentions that using log (n) and n-1 normalization cannot guarantee bounded and inconvenient comparisons, with a slightly more complex normalization formula: $ $s (x,n) = 2^{(-\frac{h (x)}{c (n)})}$$ , $$ c (n) = 2H (n−1) − (2 (n−1)/n), where H (k) = ln (k) + \xi,\xi is Euler constant $$

$s (x,n) $ is the Itree anomaly exponent that records x in the training data of n samples, $s (x,n) $ range is [0,1], the closer to 1 is the likelihood of an anomaly, the closer to 0 is the likelihood of a normal point is higher, if the majority of the training sample s (x,n) are close to 0.5, indicating that there are no significant outliers in the entire data set.

Random selection of properties, random selection of attribute values, a tree so casually make sure is not reliable, but the combination of many trees to become strong;

iforest

Itree figure out, let's see how iforest is constructed, given a data set containing N records d, how to construct a iforest. Iforest and random forest methods are similar, are randomly sampled one by one parts of the data set to construct each tree, to ensure the difference between different trees, but iforest and RF, the sampled data volume $psi$ does not need to be equal to N, can be far less than N, The paper mentions that the sample size of more than 256 effect on the promotion of small, clear the larger will also cause the waste of the computation time, why not like other algorithms, the more data the better, you can look at the following two graphs,

The left is the element data, the right is sampled data, Blue is a normal sample, red is an anomaly sample. It can be seen that before sampling, normal samples and abnormal samples overlap, so it is difficult to separate, but the sum of our samples, abnormal samples and normal samples can be clearly separated.

In addition to limiting the sample size, you also set the maximum height $l=ceiling (log_2^\psi) $ for each itree, because the exception data records are relatively small, the path length is relatively low, and we only need to distinguish between normal and abnormal records, So just take care of the lower-than-average-height sections, so the algorithm is more efficient, but after this adjustment, you can see that the calculation $h (x) $ needs a little improvement, first look at the pseudo-code of Iforest:

After the iforest is constructed, the results of each tree need to be synthesized when predicting the test, so $ $s (x,n) = 2^{(-\frac{e (H (x))}{c (n)})}$$

$E (H (x)) $ means record x in each tree height mean, and h (x) calculation needs to be improved, when the leaf node is generated, the algorithm records the number of records contained in the leaf node, this time to estimate the average height with this quantity $size$, h (x) is calculated as follows:

working with high-dimensional data

In the processing of high-dimensional data, the algorithm can be improved, after sampling not all the attributes are used, but with the kurtosis coefficient kurtosis to pick some valuable properties, and then the construction of Itree, which is more like random forest, random selection of records, and then randomly selected properties.

use only normal samples

This algorithm is essentially a unsupervised learning, do not need the class of data, sometimes the abnormal data is too little, less than we are willing to take these abnormal samples to test, can not be trained, the paper mentioned that only the normal sample construction iforest is feasible, the effect is reduced, but also good, And you can improve the effect by adjusting the sample size appropriately.

Full text, reproduced please specify the source: http://www.cnblogs.com/fengfenggirl/p/iForest.html

Anomaly Detection Algorithm--isolation Forest

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.