1 outlier and outlier analysis 1.2 outliers of type A. Global outliers
Deviate significantly from the rest of the data set, the simplest class of outliers.
Detection method: Find a suitable deviation measure
B. Contextual outliers
Outliers are dependent on context. Divided into contextual attributes (defining the context of an object) and behavior attributes (defining the characteristics of an object)
C. Group Outliers
Subsets of Data Objects form collective outliers, if these objects deviate significantly from the entire data set as a whole.
1.3 The challenge of outlier detection
The boundary between normal data and abnormal data is not obvious;
≠ noise from outlier points
2 Outlier Detection method
Two main categories:
A. Depending on whether the data sample used for the analysis is provided by a domain expert, it can be used to construct the label of the outlier detection model and classify outlier detection methods:
2.1 Supervision, semi-supervision, unsupervised a. Method of supervision:
The expert marks a normal object, constructs a classifier on it, and other objects that do not match the normal object model are considered outliers.
Challenge: Class imbalance problem; capturing as many outliers as possible is more important than the normal object Wudang outliers.
B. Unsupervised methods:
Without tags, assume that "normal objects are clustered in some way."
Central idea: Find clusters first, and then objects that are not part of any cluster are detected as outliers.
Two problems: objects that do not belong to any cluster may be noise, not outliers; finding clusters first can be too expensive to find outliers.
C. Semi-supervised approach
B. Each method is grouped according to the assumptions of the methods with respect to normal objects and outliers:
3 Statistical methods
Statistics: It is assumed that normal data objects are produced by a statistical model, and that normal objects appear in high probability regions of the stochastic model, while objects in low probability regions are outliers.
Parameter method:
One-element outlier detection based on normal distribution:
A. Maximum likelihood detection unary outliers;
B.grubb test (maximum standard residual error test);
Multivariate outliers: (Core idea: Transform multiple outlier detection tasks into one-element outlier detection problem).
A. Detection of multivariate outliers from Mahalanobis;
Multivariate outlier detection of b.x² statistics;
Using Mixed parameter distributions
A. Assume that normal data objects are generated by multiple normal distributions;
B. Detection of multivariate outliers using multiple clusters;
Non-parametric method:
Histogram detect outliers
Disadvantage: It is difficult to choose a suitable box size, the box is too small, easy to be mistaken for outliers, the box is too large, outliers are easily mistaken for normal.
In order to solve this problem, we can use kernel density estimation to estimate the probability density distribution of the data. Consider each observation object as a high probability density indicator in the surrounding area. The probability density on a point depends on the distance from the point to the observed object. Use kernel functions to model the impact of a sample point on its neighborhood. The kernel function is a non-negative real numerical integrable function.
4 proximity-based approach
Suppose an object is an outlier, and if its nearest neighbor in the feature space is also away from it, that is, the object and its nearest neighbor deviate significantly from the proximity of other objects in the dataset to their neighbors
Distance-based outlier detection and nested loop Method--a study of the neighborhood of a given radius of an object
The method--cell based on grid
Density-based outlier detection--examining the object and its neighboring density
5 Clustering-based approach
It is assumed that the normal data objects belong to large dense clusters, while outliers belong to small or sparse clusters, or not to clusters.
- Clustering-based outlier detection using distances to the nearest cluster;
- Intrusion detection through cluster-based outlier detection;--cblof
- Detection of outliers in small clusters;
Advantages:
Non-supervised
Disadvantages:
The effectiveness relies on the clustering method used, which is very expensive
6 classification-based approach
Points are labeled and can be used to build classifiers: using SVM to construct decision boundaries
7 Mining scenarios outliers and collective outliers 7.1 contextual outliers---> traditional outlier Detection 7.2 about context-modeling of normal behavior 7.3 mining collective outliers
- Identify structural units
- Modeling the expected behavior of a structural unit directly
8 outlier detection in high-dimensional data
Challenge:
- Explanation of Outliers
- Sparsity of data
- Data sub-space
- Scalability of dimensions
8.1 Extended traditional outlier detection
eg. Hilout algorithm
Ideas: high-dimensional protocols to low-dimensional, using traditional outlier detection methods
PCA principal component analysis can be used to reduce dimension
8.2 Outliers in Discovery subspace 8.3 high-dimensional outlier modeling
eg. can calculate the angle
"Reading notes-data mining concepts and techniques" outlier detection