In machine learning, anomaly detection and processing is a relatively small branch, or is a byproduct of machine learning, because in general prediction problems, the model is usually an expression of the overall sample data structure, which usually captures the general nature of the whole sample, And those that behave completely inconsistent with the whole sample in these properties, we call it anomaly, usually the anomaly in the prediction problem is not welcomed by developers, because the prediction problem of the production of the whole sample is concerned about the nature of the anomaly, the generation mechanism and the overall sample is completely inconsistent, if the algorithm is sensitive to the anomaly, The resulting model does not have a good representation of the overall sample, and the predictions are inaccurate.
On the other hand, the anomaly, in some cases, would be of great interest to the analyst, such as disease prediction, usually healthy people's body indicators are similar in some dimensions, if a person's body indicator is abnormal, then his physical condition in some aspects certainly changed, Of course, this change is not necessarily caused by disease (often referred to as noise), but the occurrence and detection of anomalies is an important starting point for disease prediction. Similar scenarios can also be applied to credit fraud, cyber attacks, and so on.
General outlier detection methods are based on statistical methods, based on the method of clustering, and some special methods to detect outliers, the following methods are described.
If used pandas
, we can use it directly describe()
to observe the statistical description of the data (just a cursory look at some statistics), but the statistics are sequential, as follows:
Or simply using scatter plots can also be very clear to see the existence of outliers. As shown below:
This principle has a condition: the data needs to obey the normal distribution. In the 3 principle, if the outlier is more than 3 times times the standard deviation, then it can be considered an outlier. The probability of 3 or minus is 99.7%, then the probability of a value other than 3 of the distance mean is P (|x-u| 3?) = 0.003, which is a very individual small probability event. If the data is not subject to normal distribution, it can also be described by how many times the standard deviation is removed from the average.
The Red Arrows refer to outliers.
This method uses the four-bit distance (IQR) of the box chart to detect outliers, also known as Tukey's test. The box diagram is defined as follows:
The four-bit distance (IQR) is the difference between the four-bit and the next four-bit points. And we pass the IQR 1.5 times times as the standard, stipulates: More than four minutes + 1.5 times times IQR distance, or the next four minutes-1.5 times times IQR distance points are outliers. The following is the code implementation in Python, the main use numpy
of the percentile
method.
You can also use seaborn
a visual method boxplot
to implement:
The Red Arrows refer to outliers.
The above is a commonly used to determine the outliers of the simple method. The following is to introduce some of the more complex detection outlier algorithm, because of the content involved, only to introduce the core ideas, interested friends can do their own in-depth study.
This method usually constructs a probability distribution model, calculates the probability that the object conforms to the model, and treats the object with the low probability as the anomaly point. If the model is a collection of clusters, the exception is an object that does not significantly belong to any cluster, and if the model is a regression, the exception is an object that is relative away from the predicted value.
Probability definition of outliers: Outliers are an object, a probabilistic distribution model of data, which has a low probability. The premise of this is that you must know what distribution the data set is subject to, and if an error is estimated, it causes a heavy-tailed distribution.
For example, the method of feature engineering RobustScaler
, when doing data eigenvalue scaling, it will use the data characteristics of the distribution of the division, the data according to the number of partitions divided into multiple segments, only take the middle segment to do the scaling, such as only 25%-bit to 75%-bit data to zoom. This reduces the impact of the exception data.
Advantages and Disadvantages: (1) There is a solid statistical theoretical basis, when there is sufficient data and knowledge of the type of test used, these tests may be very effective; (2) for multivariate data, there are fewer options available, and for high-dimensional data, these are poorly detected.
Statistical method is to use the distribution of data to observe outliers, some methods even need some distribution conditions, and in the actual data distribution is difficult to achieve some assumptions, in the use of some limitations.
It is easier to determine a meaningful proximity metric for a dataset than to determine its statistical distribution. This method is more general and easier to use than a statistical method because the outlier score of an object is given by the distance to its K-nearest neighbor (KNN).
It is important to note that the outliers score is highly sensitive to the value of K. If k is too small, a small number of neighboring outliers may result in a lower outlier score; If K is too large, then all objects in a cluster with fewer than k points may become outliers. To make the scheme more robust for the selection of K, the average distance of K nearest neighbors can be used.
Advantages and Disadvantages: (1) simple; (2) Disadvantage: The method based on proximity requires O (m2) time, the large data set is not applicable, (3) The method is sensitive to the selection of parameters, and (4) a dataset with different density regions cannot be processed because it uses global thresholds and cannot account for this density change.
From a density-based point of view, outliers are objects in low-density areas. Outlier detection based on density is closely related to outlier detection based on proximity, because density is usually defined by proximity. A common method of defining density is to define the reciprocal of the average distance to K nearest neighbors. If the distance is small, the density is high and vice versa. Another density definition is the density definition used by the Dbscan clustering algorithm, that is, the density around an object is equal to the number of objects within that object's specified distance from D.
Advantages and Disadvantages: (1) The object is a quantitative measurement of outliers, and even if the data have different regions can be handled very well, (2) As with the distance-based approach, these methods must have O (M2) time complexity. For low-dimensional data using a specific data structure can be achieved O(mlogm)
; (3) parameter selection is difficult. Although the LOF
algorithm deals with the problem by observing different k values and then obtaining the maximum outliers score, it is still necessary to select the upper and lower bounds of these values.
Clustering-based outliers: An object is a cluster-based outlier, and if the object is not strongly part of any cluster, then the object belongs to the outlier point.
Effects of outliers on initial clustering: If outliers are detected by clustering, there is a problem with outliers affecting clustering: whether the structure is valid or not. This is also k-means
the disadvantage of the algorithm, sensitive to outliers. To deal with this problem, you can use the following methods: Object clustering, deleting outliers, objects clustering again (this is not guaranteed to produce optimal results).
Advantages and Disadvantages: (1) The clustering technique based on linear and near-linear complexity (K-means) is used to discover that outliers may be highly effective; (2) cluster definition is usually the complement of outliers, so clusters and outliers may be found simultaneously. ; (3) The set of outliers and their scores may be very dependent on the number of clusters used and the existence of outliers in the data; (4) The quality of clusters generated by clustering algorithm has great influence on the quality of outliers produced by the algorithm.
In fact, the above-mentioned clustering method is the original intention is unsupervised classification, not to find outliers, just happen to its function can achieve outlier detection, is considered a derivative function.
In addition to the above mentioned methods, there are two methods specifically used to detect anomalies: One Class SVM
and Isolation Forest
, the details are not studied in depth.
The outliers are detected and we need to handle them. The general outlier processing methods can be broadly divided into the following types:
• Delete records that contain outliers: Delete the records containing outliers directly;
• Treated as missing values: treat outliers as missing values and process them using missing value processing methods;
• Average correction: The outliers can be corrected with the average value of two observations before and after;
• Do not process: data mining directly on data sets with outliers;
Whether you want to delete outliers can be considered according to the actual situation. Because some models are not sensitive to outliers, even if there are outliers that do not affect the model effect, some models, such as logistic regression LR, are sensitive to outliers, and if not processed, there may be very poor effects such as fitting.
The above is a summary of outlier detection and processing methods.
Through some detection methods we can find outliers, but the results are not absolutely correct, the situation also needs to be judged according to the business understanding. Similarly, the exception value how to deal with, is the deletion, correction, or not processing also need to combine the actual situation to consider, there is no fixed.
?
"Fundamentals of Python Data Analysis": Outlier Detection and processing