"Reading notes-data mining concepts and techniques" outlier detection

Last Update:2015-04-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 outlier and outlier analysis 1.2 outliers of type A. Global outliers

Deviate significantly from the rest of the data set, the simplest class of outliers.

Detection method: Find a suitable deviation measure

B. Contextual outliers

Outliers are dependent on context. Divided into contextual attributes (defining the context of an object) and behavior attributes (defining the characteristics of an object)

C. Group Outliers

Subsets of Data Objects form collective outliers, if these objects deviate significantly from the entire data set as a whole.

1.3 The challenge of outlier detection

The boundary between normal data and abnormal data is not obvious;

≠ noise from outlier points

2 Outlier Detection method

Two main categories:

A. Depending on whether the data sample used for the analysis is provided by a domain expert, it can be used to construct the label of the outlier detection model and classify outlier detection methods:

2.1 Supervision, semi-supervision, unsupervised a. Method of supervision:

The expert marks a normal object, constructs a classifier on it, and other objects that do not match the normal object model are considered outliers.

Challenge: Class imbalance problem; capturing as many outliers as possible is more important than the normal object Wudang outliers.

B. Unsupervised methods:

Without tags, assume that "normal objects are clustered in some way."

Central idea: Find clusters first, and then objects that are not part of any cluster are detected as outliers.

Two problems: objects that do not belong to any cluster may be noise, not outliers; finding clusters first can be too expensive to find outliers.

C. Semi-supervised approach

B. Each method is grouped according to the assumptions of the methods with respect to normal objects and outliers:

3 Statistical methods

Statistics: It is assumed that normal data objects are produced by a statistical model, and that normal objects appear in high probability regions of the stochastic model, while objects in low probability regions are outliers.

Parameter method:

One-element outlier detection based on normal distribution:

A. Maximum likelihood detection unary outliers;

B.grubb test (maximum standard residual error test);

Multivariate outliers: (Core idea: Transform multiple outlier detection tasks into one-element outlier detection problem).

A. Detection of multivariate outliers from Mahalanobis;

Multivariate outlier detection of b.x² statistics;

Using Mixed parameter distributions

A. Assume that normal data objects are generated by multiple normal distributions;

B. Detection of multivariate outliers using multiple clusters;

Non-parametric method:

Histogram detect outliers

Disadvantage: It is difficult to choose a suitable box size, the box is too small, easy to be mistaken for outliers, the box is too large, outliers are easily mistaken for normal.

In order to solve this problem, we can use kernel density estimation to estimate the probability density distribution of the data. Consider each observation object as a high probability density indicator in the surrounding area. The probability density on a point depends on the distance from the point to the observed object. Use kernel functions to model the impact of a sample point on its neighborhood. The kernel function is a non-negative real numerical integrable function.

4 proximity-based approach

Suppose an object is an outlier, and if its nearest neighbor in the feature space is also away from it, that is, the object and its nearest neighbor deviate significantly from the proximity of other objects in the dataset to their neighbors

Distance-based outlier detection and nested loop Method--a study of the neighborhood of a given radius of an object

The method--cell based on grid

Density-based outlier detection--examining the object and its neighboring density

5 Clustering-based approach

It is assumed that the normal data objects belong to large dense clusters, while outliers belong to small or sparse clusters, or not to clusters.

Clustering-based outlier detection using distances to the nearest cluster;
Intrusion detection through cluster-based outlier detection;--cblof
Detection of outliers in small clusters;

Advantages:

Non-supervised

Disadvantages:

The effectiveness relies on the clustering method used, which is very expensive

6 classification-based approach

Points are labeled and can be used to build classifiers: using SVM to construct decision boundaries

7 Mining scenarios outliers and collective outliers 7.1 contextual outliers---> traditional outlier Detection 7.2 about context-modeling of normal behavior 7.3 mining collective outliers

Identify structural units
Modeling the expected behavior of a structural unit directly

8 outlier detection in high-dimensional data

Challenge:

Explanation of Outliers
Sparsity of data
Data sub-space
Scalability of dimensions

8.1 Extended traditional outlier detection

eg. Hilout algorithm

Ideas: high-dimensional protocols to low-dimensional, using traditional outlier detection methods

PCA principal component analysis can be used to reduce dimension

8.2 Outliers in Discovery subspace 8.3 high-dimensional outlier modeling

eg. can calculate the angle

"Reading notes-data mining concepts and techniques" outlier detection

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Reading notes-data mining concepts and techniques" outlier detection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Reading notes-data mining concepts and techniques" outlier detection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support