Data preprocessing-outlier recognition

Source: Internet
Author: User

Data preprocessing-outlier recognitionfrom:http://shataowei.com/2017/08/09/%e6%95%b0%e6%8d%ae%e9%a2%84%e5%a4%84%e7%90%86-%e5%bc%82%e5%b8%b8%e5% 80%bc%e8%af%86%e5%88%ab/

The system summarizes the common methods of outlier recognition, and organizes the following:

Spatial recognition of the space-aware Division number

The representative's execution method is box-type diagram :

The four-digit Q3, also called the 75%-bit point of the ascending sequence.
The next four Q1, also called the 25%-bit point of the ascending sequence.
The box-type chart test is to remove Q3+3/2*(Q3-Q1) the data greater than, less than, Q1-3/2*(Q3-Q1) and to determine that it is an outlier; for the full-size sample is known, the disadvantage is that the amount of data at the time of the sequencing consumption
Functions in the R language quantile , functions in Python percentile can be implemented directly.

Distance recognition

The most common is the European-style distance ,
For example: the Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n):

Can be intuitively felt, in the figure, the distance from the Blue B point is measured, the red A1, Red A2, red A3 for the distance closer to the point, A4 for the distant anomaly.


But there is a hidden problem with this question, and we have made the mistake of "point argument" without considering the overall problem, let me look at the following picture:


Or just that picture, the orange background for the original data set distribution, so it seems that the location of the A4 is more than A1, A3 relatively closer to the base point B, so in the presence of inconsistent and data distribution anomalies, the Markov distance can be used instead of Euclidean distance to determine whether the data outliers.


Where, for μ the mean value of the feature, X as the observed value, the Σ covariance matrix of the feature
In addition to determining whether the point is abnormal, Markov distance can also be used to determine the two data sets of the acquaintance degree, in the image recognition, anti-fraud recognition is also very common application; The problem is too much reliance Σ , the different base case Σ is inconsistent, not very stable

Density identification

Density to identify the way the method is more, this side to provide a comparison of the classic, first we can through the density of the famous in the cluster, dbscan this way only talk about ideas, detailed algorithm process introduced separately.
In simple terms, the following diagram can help understand

We can randomly select the link point, the artificial setting of the minimum radius a near the link, the minimum number of tolerance points in the radius of B, and then consider the density can be reached, the formation of a blue square within the normal data area, the rest of the yellow area is the point of the anomaly.

In addition, there is another way of density identification, which is the reference point density near a single point, pseudo code is as follows:

1
2
3
4
5
6
7
1. Choose from a feature set of two dimensions that have not been selected historically
2. Map the original point set to the two dimensional plane polygons and portray the point set Center a
3. Draw a circle with the a,x of the center of the point, expanding the value of x until the lowest threshold of the number of points to be overwritten/the number of original point sets
4. The set of points that are not covered Outlier_label
5.repeat 1-4
6. Number of Outlier_label corresponding to the statistic point
7. High order priority is the exception point

This side of the Clique map + Denclue The density distribution function of the idea, pay attention to the problem is a large amount of computation, so the application of small samples more high, for large samples of multiple characteristics of the data can be considered for the sample subset sampling, and then according to the subset of 1-5, summarized after the overall 6-7 steps, The actual test results can still be achieved without sampling more than 85%

Pauta Guidelines

This method is more partial statistics, design to some distance calculation, reluctantly placed in space recognition inside

This discriminant processing principle and method is confined to the normal or approximate normal distribution of the sample data processing, it is based on the premise of sufficient data , when the data is small, it is best not to choose the criterion.

Normal distribution (Gaussian distribution) is one of the most commonly used probability distributions, usually with a normal distribution of two parameters μ and σ as standard deviation. N (0,1) is the standard normal distribution.


From the above graph can be seen, when the data object value deviation mean 3 times times the standard deviation, the data reasonable probability of occurrence is less than 3, so can be directly identified, the data deviation mean ± 3 times times the standard deviation, the anomaly point. But finally again: It is the premise of large data, and when the data is small, it is best not to use the criteria.

Metrological identification G-test or likelihood ratio method.

G-test This kind of method uses in the medical aspect more, often is used to examine the observation variable value to conform to the theoretical expectation ratio. Now also used in e-commerce, travel, search the field of testing some unsupervised models of quality, data quality.

When we have a new model, some of the user's feedback is particularly abnormal, we do not know whether the abnormal data, in the next analysis need not be removed, we can use statistical methods to choose.

Where o is the observed value, E is expected, if our site 24 hours a day the order quantity distribution is stable, the interval calculated a mean, e1,e2,. E24, after the new model output, our problem user base corresponds to the 24-hour order distribution value O1,o2,..... O24, applying the above formula, we can calculate a G value out.

Then, according to the base value of g-test, the maximum confidence of the target user is observed, and the confidence level is in accordance with our minimum requirements, and the likelihood ratio method is similar, the related papers can be searched directly.

Model Fitting

These methods are simple and supervised, including Bayesian recognition, decision tree recognition, linear regression recognition and so on.

Need to know in advance two sets of data: normal data and abnormal data, and then according to their corresponding characteristics, to fit a curve as far as possible, followed directly with the curve to determine whether the new data is normal.
As an example:
In financial lending, we have a number of normal loan users, and overdue users, we pass the scorecard model to identify the characteristics of known users, assuming that the sesame points, the use of mobile phone length, whether the male is the key feature. Next determine whether the new data of the unknown label is a normal user, directly based on the previously determined to fit the score card curve to do 0-1 probability estimates on the line.

However, the model fitting of the use of the situation is more limited, the vast majority of anomaly recognition problems can not get pre-existing historical data, or the data can not be divided to cover the full potential, resulting in a large time judgment error, gu generally only do emsemble model of one of the combined modules, do not recommend the main dependency labeling.

The University of Minnesota has a summary of the identification of unusual papers, which are closely related to models such as supervised models, semi-supervised models and unsupervised models, and if interested can be studied, attach papers survey:http://cucis.ece.northwestern.edu /projects/dms/publications/anomalydetection.pdf

Variable dimension recognition

First, let's take a look at the pseudo-code of PCA

1
2
3
4
5
1. Eliminate the average, facilitate the subsequent covariance, the calculation of the variance matrix
2. Calculating covariance matrices and their eigenvalues and eigenvectors
3. Sort the eigenvalues from large to small, the eigenvalues can reflect the variance contribution, the larger the eigenvalues, the greater the variance contribution
4. Retain the maximum n eigenvalues and their corresponding eigenvectors
5. Map the data to a new space constructed of the N eigenvectors above

The core idea of PCA is to maximize the variance of the original data by replacing the original feature with the feature combination as possible .
PCA can be used to obtain the first principal component, the second principal component ....
For normal data sets, the normal data volume is much larger than the abnormal data, so the variance of normal data is much larger than that of abnormal data, and the main component of the pre-ranking by PCA explains the large variance ratio of the original data, so theoretically, the first principal component reflects the variance of the normal value, The last principal component reflects the variance of the anomaly. after the original data is mapped by the first principal component, the properties of the normal sample and the exception sample in the original data do not change.

A data set with a P dimension is orgin_data,x as its covariance matrix, which can be obtained by singular value decomposition :


where d is a diagonal array, each value of which is the corresponding characteristic value of x, each column of P is the corresponding eigenvector of the x, and the eigenvalues in D are arranged from large to small, and the corresponding columns vectors of p are changed accordingly.
We select the eigenvalues in Top (j) d, their P-corresponding eigenvectors (p,j)-dimensional matrix PJ, and then map the target dataset Orgin_data: new_data = orgin_data*pj ; New_data is a (n,j)-dimensional matrix. If you consider pulling back the mappings (that is, from the main component space to the original space), the reconstructed data set is: A back_data=transpose(pj*transpose(new_data))=new_data*transpose(pj) data set formed after refactoring using Top-j's principal component, which is a (n,p) dimension matrix.

So, we have the following definition of outlier Socres:


EV (j) subject to.

To explain the above two formulas, first calculate the score in the Orgindata column minus the former J Principal component mapping back to the original space of the newdata of the European norm value, and then consider the different principal components need to multiply the weight, here, we believe that the first principal component represents the data more normal data, so the smaller the weight When J takes the principal component of the last dimension, we think that the weights are the highest, reaching 1.


For outlier Socres, the point is the anomaly.

Neural network Identification

Before the comparison of the fire of the Neural network analysis, the same can be used to do a supervised anomaly recognition, this side introduced Replicator neural Networks (Rnns).


Here we can see through the image:
1. Input layer, input variable number and output layer, output variable consistent
2. The number of nodes in the middle tier is less than the input output layer node
3. The entire training process is a process of decompression after compression

Conventionally, we look at the error of the model through the MSE

Let's take a general look at the operating logic of RNN, first, the leftmost input layer is the original data, the right-most layer is the output layer is the output data. The activation functions of the intermediate layers are different, and the values of the parameters obtained by the active function are inconsistent, but the activation functions are consistent at the same level.
For our anomaly recognition, the second and fourth layers (k=2,4), the activation function is selected as

Tanh image as follows, you can compress the raw data between 1 to 1, so that the original data bounded.

For the middle tier (k=3), the activation function is a class ladder (step-like) function.

where n is the number of ladder layers, A3 is the efficiency of ascension. The more n the number, the more the hierarchy.

For example, in the form of n=5:

For example, in the form of n=3:


The advantage of this is that with the addition of N, the outliers or outliers can be concentrated in a discrete ladder range.
Through the supervised training of RNN, the abnormal sample classifier is constructed to identify outliers.

Isolation Forest

In 2010, Professor Zhou Zhihua of South University proposed an outlier recognition algorithm based on binary tree, and in the industry, the effect is very good, recently I also made a drain user model, the measured effect is excellent.

Like the random forest, isolation forest is made up of isolation tree, first looking at the logic of isolation tree:

1
2
3
4
5
6
7
Method
1. Randomly select an attribute feature from the original data;
2. Randomly select a sample value below the attribute from the original data;
3. Classify each record according to value under feature, place a record less than value on the left subset, and put a record greater than or equal to value in the right subset;
4.repeat 1-3 until:
4.1. The incoming data set has only one record or multiple records;
4.2. The height of the tree reaches a limited height;

A general idea such as:

In theory, outliers are generally outliers, and are easily divided into final sub-nodes in the early stages of the data. Therefore, by calculating the depth h (x) of each child node, the possibility of data being abnormal data is judged. In this paper, S (x,n) is measured to determine whether the data is abnormal.

where h (x) is the node depth corresponding to X, C (n) is the sample confidence, S (x,n) ~[0,1], the normal data for S (x,n) is less than 0.8,s (X,n) closer to 1, the likelihood of data anomalies is greater.

The credibility of a single tree is insufficient, so we use the Emsemble model to construct a forest tree to improve accuracy.
However, as isolation forest, it is necessary to change the formula of the original S (x,n) by E (H (x)) instead of H (x), where E (H (x)) is the average of the H (x) of data x on each tree.
While
1. Because the number of trees is greatly increased, so we need to control the cost of the calculation, so each tree we can take the data sampling method, so that the sampling data set is far smaller than the original data set, and according to Zhou Zhihua Teacher's paper, the sample size of more than 256 effect will be raised little.
2. We can control the depth, so that no tree maximum depth limit length=ceiling (log2 (sample size)), when the tree depth is greater than the maximum depth, the majority of the child nodes are normal data nodes, loss of the significance of abnormal test.

Data preprocessing-outlier recognition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.