[Translate] Use R language to dig data "six"

Source: Internet
Author: User

Outlier detection One, experimental description 1. Environment Login

No password automatic login, system user name Shiyanlou, password Shiyanlou

2. Introduction to the Environment

This experiment environment uses the Ubuntu Linux environment with the desktop, the experiment will use the program:

1. LX Terminal (lxterminal): Linux command line terminal, Open will enter the bash environment, you can use the Linux command
2. GVim: Very useful editor, the simplest usage can refer to the course [Vim Editor] (HTTP://WWW.SHIYANLOU.COM/COURSES/2)
3. R: Enter ' R ' into the interactive environment at the command line, the code below is running in the interactive environment.

3. Use of the environment

Use the R language Interactive environment to enter the code and files required for the experiment, and use the LX Terminal (lxterminal) to run the required commands.

After completing the experiment, you can click "Experiment" above the desktop to save and share the results to Weibo to show your friends the progress of your study. The lab building provides a back-end system that can prove to be true that you have completed the experiment.

The Experiment records page can be viewed in the My Home page, which contains each experiment and notes, as well as the effective learning time of each experiment (refers to the time of the experiment desktop operation, if there is no action, the system will be recorded as Daze time). These are the proof of authenticity of your studies.

Ii. introduction of the course

This section mainly explains how to use R to detect outlier values. The main contents are as follows:
1. Single-Variable outlier detection
2. Using local anomaly factors for outlier detection
3. Test outliers by clustering method
4. Test the outliers in the time series data

Iii. Course contents 1, single variable outlier detection

This section focuses on univariate outlier detection and shows how to apply it to multivariate (multiple argument) data. Use the function boxplot.stats () to implement univariate detection, which generates a box plot based on the statistics returned. In the returned result of the above function, there is an argument out, which is a list of outliers. More specifically, it lists the data points outside the box-line plot in the box. The parameter Coef can control the length of the box whisker line extending from the box, and more details about the function can be viewed by entering '? Boxplot.ststs '.

Drawing Box Line diagram:

> Set.seed (3147) # generates 100 data that obeys a normal distribution > x <-rnorm (+) > Summary (x) # Output outliers > boxplot.stats (x) $out # Plot the box line chart > BoxPlot (x)

The box line diagram is as follows:

The four circles in the box represent four outliers, and then try to examine the outliers in the multivariate variable.

> y <-rnorm (100) # generates a data frame containing column names X and y respectively df> DF <-data.frame (x, y) > rm (x, Y) > Head (DF) # Connection Data frame df> Attach ( DF) # Output X Exception value > (a <-which (x%in% boxplot.stats (x) $out)) # Output y outliers > (b <-which (y%in% boxplot.stats (y) $out)) &G T Detach (DF) # Disconnects the connection to the data frame # output x, y the same exception value > (Outlier.list1 <-intersect (A, b)) > Plot (DF) # Callout Exception point > points (df[ Outlier.list1,], col= "Red", pch= "+", cex=2.5) # exception value in X or y > (outlier.list2 <-Union (b)) > Plot (DF) > points (DF [Outlier.list2,], col= "Blue", pch= "X", cex=2)

In one application, if there are three or more 3 independent variables, the final list of outliers should be based on the overall condition of the exception data detected by each univariate anomaly. In real-world applications, the theory and program running results should be considered to test the more appropriate outliers.

2. Using Lof (local anomaly factor) to detect outliers

LOF (local anomaly factor) is a kind of algorithm based on density recognition anomaly value. The algorithm is implemented by comparing the local density of a point with the density of the points that are distributed around it, if the former is significantly smaller than the latter, then the point is in a relatively sparse area relative to the surrounding point, which indicates that the point is an outlier. The disadvantage of the LOF algorithm is that it is only valid for numeric data.

The local anomaly factor in the LOF algorithm can be computed using lofactor () in package ' DMWR ' and package ' dprep '.

> Library (DMWR) # Remove "species" this Iris category column data > Iris2 <-iris[,1:4]# k is the number of points around the anomaly that is needed to calculate the local anomaly factor > Outlier.scores <-Lofactor (Iris2, k=5) # Plot anomalies worthy of distribution graph > plot (Density (outlier.scores)) # pick out the top five data for the score as outliers > outliers <-order ( Outlier.scores, decreasing=t) [1:5]# output outliers > Print (outliers)

Next, the iris data is analyzed by the principal component, and the outliers are displayed using the first two main components produced as two plots.

> N <-nrow (IRIS2) > Labels <-1:n# all data except outliers with '. ' Callout > Labels[-outliers] <-"." > Biplot (PrComp (IRIS2), cex=.8, Xlabs=labels)

The output results are as follows:

In the above code, PRCOMP () implements the principal component analysis of the data set Iris2, and Biplot () takes the first two columns of the principal component analysis result, which is the first two main components to draw the double plotting. , the x and Y axes represent the first to second main component respectively, and the arrows point to the original variable name, where 5 outliers are labeled with the corresponding line number respectively.

We can also use the pairs () function to draw a scatter graph matrix to display outliers, where outliers are labeled with a red ' + ':

# Use REP () to generate n '. ' > PCH <-rep (".", N) > Pch[outliers] <-"+" > Col <-rep ("Black", N) > Col[outliers] <-"Red" > Pai RS (Iris2, pch=pch, Col=col)

Scatter Chart matrix:

Packet Rlof provides function Lof () to implement the LOF algorithm in parallel. Its usage is similar to Lofacotor (), but Lof () can achieve two additional functions: K can be a vector and select multiple distance side degrees. Here is an example of the implementation of the LOF () function:

> Library (Rlof) > Outlier.scores <-lof (Iris2, k=5) # Try using a different K-value > Outlier.scores <-lof (Iris2, K=c (5:10))
3. Detection of outliers by clustering

Another way to detect outliers is clustering. First, gather the data into different classes and select data that does not belong to any class as an outlier. For example, the implementation of a density-based clustering dbscan algorithm is to divide data objects that are tightly bound to data dense regions into a class, so that data that is detached from other objects is treated as outliers.

It is also possible to use the K-mean algorithm to detect outlier values. First, by dividing the data into K-groups, the partitioning method is to select the nearest point of the cluster Center as a group, then calculate the distance (or similarity) between each object and the corresponding cluster center, and pick out the point with the maximum distance as the outlier.

Using the iris data set, the code for outlier checking in conjunction with the K-mean algorithm is as follows:

> Iris2 <-iris[,1:4]> kmeans.result <-Kmeans (Iris2, centers=3) # output Cluster center > kmeans.result$centers# Classification results > K means.result$cluster# calculating the distance between the data object and the cluster center > Centers <-kmeans.result$centers[kmeans.result$cluster,]> distances <-sqrt (rowsums (iris2-centers) ^2) # pick out the top 5 maximum distances > outliers <-order (distances, decreasing=t) [1:5]# output outliers > Print (outliers) > Print (Iris2[outliers,]) # Draw Clustering Results > Plot (iris2[,c ("Sepal.length", "Sepal.width"), pch= "O", + col =kmeans.result$cluster, cex=0.3) # Draw the center of the class (cluster) with the ' * ' tag > points (kmeans.result$centers[,c ("Sepal.length", "sepal.width ")], col=1:3,+ pch=8, cex=1.5) # Draw outliers with ' + ' tags > points (iris2[outliers, C (" Sepal.length "," Sepal.width ")], pch=" + ", col =4, cex=1.5)

The results are shown below:

# # #4, anomaly values in the detection time series

This section describes how to detect outliers from time series data. First, the function STL () is used to decompose the time series data with robust regression method, and then the outliers are identified. The implementation code is as follows:

# using robust regression fitting > F <-stl (Airpassengers, "periodic", Robust=true) > (outliers <-which (f$weights<1e-8)) # Drawing Layout & Gt Op <-par (mar=c (0, 4, 0, 3), Oma=c (5, 0, 4, 0), Mfcol=c (4, 1)) > Plot (F, set.pars=null) > STS <-F$TIME.SERIES&G T Draw outliers with red ' X ' markers > points (Time (STS) [outliers], 0.8*sts[, "remainder"][outliers], pch= "x", col= "red") > Par (OP) # Reset Layout

, the graph of remainder is the unconstrained data, which is the noise data, which is preserved after the decomposition and removal of the season and trend factors.

5. Thinking

Try to think of other outlier checking algorithms and query whether other packages in R can detect outliers well.

[Translate] Use R language to dig data "six"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.