"Reading notes-data mining concepts and techniques" outlier detection

Source: Internet
Author: User

1 outlier and outlier analysis 1.2 outliers of type A. Global outliers

Deviate significantly from the rest of the data set, the simplest class of outliers.

Detection method: Find a suitable deviation measure

B. Contextual outliers

Outliers are dependent on context. Divided into contextual attributes (defining the context of an object) and behavior attributes (defining the characteristics of an object)

C. Group Outliers

Subsets of Data Objects form collective outliers, if these objects deviate significantly from the entire data set as a whole.

1.3 The challenge of outlier detection

The boundary between normal data and abnormal data is not obvious;

≠ noise from outlier points

2 Outlier Detection method

Two main categories:

A. Depending on whether the data sample used for the analysis is provided by a domain expert, it can be used to construct the label of the outlier detection model and classify outlier detection methods:

2.1 Supervision, semi-supervision, unsupervised a. Method of supervision:

The expert marks a normal object, constructs a classifier on it, and other objects that do not match the normal object model are considered outliers.

Challenge: Class imbalance problem; capturing as many outliers as possible is more important than the normal object Wudang outliers.

B. Unsupervised methods:

Without tags, assume that "normal objects are clustered in some way."

Central idea: Find clusters first, and then objects that are not part of any cluster are detected as outliers.

Two problems: objects that do not belong to any cluster may be noise, not outliers; finding clusters first can be too expensive to find outliers.

C. Semi-supervised approach

B. Each method is grouped according to the assumptions of the methods with respect to normal objects and outliers:

3 Statistical methods

Statistics: It is assumed that normal data objects are produced by a statistical model, and that normal objects appear in high probability regions of the stochastic model, while objects in low probability regions are outliers.

Parameter method:

One-element outlier detection based on normal distribution:

A. Maximum likelihood detection unary outliers;

B.grubb test (maximum standard residual error test);

Multivariate outliers: (Core idea: Transform multiple outlier detection tasks into one-element outlier detection problem).

A. Detection of multivariate outliers from Mahalanobis;

Multivariate outlier detection of b.x² statistics;

Using Mixed parameter distributions

A. Assume that normal data objects are generated by multiple normal distributions;

B. Detection of multivariate outliers using multiple clusters;

Non-parametric method:

Histogram detect outliers

Disadvantage: It is difficult to choose a suitable box size, the box is too small, easy to be mistaken for outliers, the box is too large, outliers are easily mistaken for normal.

In order to solve this problem, we can use kernel density estimation to estimate the probability density distribution of the data. Consider each observation object as a high probability density indicator in the surrounding area. The probability density on a point depends on the distance from the point to the observed object. Use kernel functions to model the impact of a sample point on its neighborhood. The kernel function is a non-negative real numerical integrable function.

4 proximity-based approach

Suppose an object is an outlier, and if its nearest neighbor in the feature space is also away from it, that is, the object and its nearest neighbor deviate significantly from the proximity of other objects in the dataset to their neighbors

Distance-based outlier detection and nested loop Method--a study of the neighborhood of a given radius of an object

The method--cell based on grid

Density-based outlier detection--examining the object and its neighboring density

5 Clustering-based approach

It is assumed that the normal data objects belong to large dense clusters, while outliers belong to small or sparse clusters, or not to clusters.

    • Clustering-based outlier detection using distances to the nearest cluster;
    • Intrusion detection through cluster-based outlier detection;--cblof
    • Detection of outliers in small clusters;

Advantages:

Non-supervised

Disadvantages:

The effectiveness relies on the clustering method used, which is very expensive

6 classification-based approach

Points are labeled and can be used to build classifiers: using SVM to construct decision boundaries

7 Mining scenarios outliers and collective outliers 7.1 contextual outliers---> traditional outlier Detection 7.2 about context-modeling of normal behavior 7.3 mining collective outliers
    • Identify structural units
    • Modeling the expected behavior of a structural unit directly
8 outlier detection in high-dimensional data

Challenge:

    • Explanation of Outliers
    • Sparsity of data
    • Data sub-space
    • Scalability of dimensions
8.1 Extended traditional outlier detection

eg. Hilout algorithm

Ideas: high-dimensional protocols to low-dimensional, using traditional outlier detection methods

PCA principal component analysis can be used to reduce dimension

8.2 Outliers in Discovery subspace 8.3 high-dimensional outlier modeling

eg. can calculate the angle

"Reading notes-data mining concepts and techniques" outlier detection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.