R language ︱ outlier test, outlier analysis, outlier processing

Last Update:2018-05-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, outlier test

Outliers include missing values, outliers, duplicate values, and inconsistent data.

1. Basic functions

Summary can display the number of missing values for each variable.

2, missing value test

Detection of missing values should include: Number of missing values, missing value proportions, missing values, and full value data filtering.

[Plain]View PlainCopy

#缺失值解决方案
Sum (complete.cases (saledata)) #is. NA (saledata)
Sum (!complete.cases (saledata))
Mean (!complete.cases (saledata)) #1/201 digits, missing value ratio
Saledata[!complete.cases (Saledata),] #筛选出缺失值的数值

3, the box type diagram test outlier value

Box chart detection includes: four-bit detection (box-chart) +1δ standard deviation up and down + outliers data points.

Box diagram has a very good place is, after the boxplot, the result will be self-contained outliers, is the code in the following sp$out, this is to do box-type diagram, according to the upper and lower bounds for outliers to determine the value.

The upper and lower boundary, respectively q3+ (Q3-Q1), q1-(Q3-Q1).

[Plain]View PlainCopy

Sp=boxplot (saledata$ "Sales", boxwex=0.7)
Title ("Sales Outlier Detection Box Line chart")
xi=1.1
SD.S=SD (Saledata[complete.cases (saledata),]$ "Sales")
Mn.s=mean (Saledata[complete.cases (saledata),]$ "Sales")
Points (xi,mn.s,col= "red", pch=18)
Arrows (xi, Mn.s-sd.s, Xi, MN.S + Sd.s, code = 3, col = "Pink", angle =, length =. 1)
Text (Rep (c (1.05,1.05,0.95,0.95), Length=length (Sp$out)), Labels=sp$out[order (Sp$out)],
Sp$out[order (Sp$out)]+rep (C (150,-150,150,-150), Length=length (Sp$out)), col= "Red")

The format of the text function in the code is text (X,label,y,col), points joins the mean point, and arrows joins the mean up and down 1δ standard deviation range arrows.

The box chart also has the equal width and the equal depth Division box method, can see another blog: R language ︱ noise data processing, packet grouping--binning method (discretization, grading)

4. Data deduplication

There are some differences between data deduplication and data grouping merging, the deduplication is purely all variables are duplicates, and data grouping merging may be due to some primary key duplication.

Data deduplication includes duplicate detection (table, unique function), and duplicate data processing (unique/duplicated).

It is common to have a unique, duplicated function in the data frame, and duplicated returns a logical value.

Second, the exception value processing

Common outlier processing methods are deletion method, substitution method (continuous variable mean substitution, number of discrete variables and median substitution), interpolation method (regression interpolation, multi-interpolation)

In addition to the direct deletion, you can first change the outliers to the missing values and then the subsequent missing values.

In practice, outlier processing, generally divided into Na missing value or return to the company for data trimming (data rework as the Main method)

1. Outlier recognition

The anomaly detection is carried out by using the graph--box pattern.

[Plain]View PlainCopy

#异常值识别
Par (MFROW=C) #将绘图窗口划为1行两列, showing two graphs simultaneously
Dotchart (inputfile$sales) #绘制单变量散点图, Dolan
Pc=boxplot (inputfile$sales,horizontal=t) #绘制水平箱形图

The code comes from the "R Language data Analysis and excavation" section fourth.

2. Cap method

The entire row replaces the point value above 99% and 1% in the data frame, =99% the point value of more than 99% points, and the point value =1% the point value less than 1%.

(This image is from CDA Dsc,l2-r language course, as the teacher often said)

[HTML]View PlainCopy

#异常数据处理
Q1<-quantile (Result$tot_derog, 0.001) variable value when #取得时1%
Q99<-quantile (Result$tot_derog, 0.999) #replacement has 1 row, the data has 0 shows a no change
Result[result$tot_derog<Q1,] $tot _derog<-q1
Result[result$tot_derog>q99,] $tot _derog<-q99
Summary (Result$tot_derog) #盖帽法之后, view data situation
Fix (Inputfile) #表格形式呈现数据
Which (inputfile$sales==6607.4) #可以找到极值点序号是啥

Separate missing value datasets and non-missing values datasets.

[Plain]View PlainCopy

#缺失值的处理
Inputfile$date=as.numeric (inputfile$date) #将日期转换成数值型变量
Sub=which (Is.na (inputfile$sales)) #识别缺失值所在行数
Inputfile1=inputfile[-sub,] #将数据集分成完整数据和缺失数据两部分
Inputfile2=inputfile[sub,]

3. Noise data Processing--the Sub-box method

After the continuous variable is graded, the data of different quantile will become different grade data, the continuous variable is discretized, and the influence of the extremum is eliminated.

4, outlier processing--mean value substitution

The dataset is divided into missing values, non-missing values, two blocks of content. Missing value processing if it is a continuous variable, you can choose the mean; discrete variables, you can choose the majority or the median.

Calculates the mean value of non-missing value data,

The value is then assigned to the missing value data.

[Plain]View PlainCopy

#均值替换法处理缺失, the results are dumped
#思路: Split into two parts, assign the missing value to a mean value and then re-unite
Avg_sales=mean (inputfile1$sales) #求变量未缺失部分的均值
Inputfile2$sales=rep (avg_sales,n) #用均值替换缺失
Result2=rbind (Inputfile1,inputfile2) #并入完成插补的数据

5, outlier processing--regression interpolation method

[Plain]View PlainCopy

#回归插补法处理缺失, the results are dumped
MODEL=LM (Sales~date,data=inputfile1) #回归模型拟合
Inputfile2$sales=predict (Model,inputfile2) #模型预测
Result3=rbind (Inputfile1,inputfile2)

6, outlier processing--multi-interpolation--mice package

Note: There are two key points to handling multiple interpolation: Delete The missing value of the Y variable and then interpolate
1, the explanatory variables have missing values of observation can not be filled, can only be deleted, can not make their own mess;
2. Only the explanatory variables inserted into the model are interpolated.

A more detailed introduction to this multi-interpolation method. The author has collated the following outline of the steps:

Missing datasets--MCMC estimate interpolation into several datasets-interpolation modeling per data set (GLM, LM model)-Integrating these models together (pool)-Evaluating interpolation models (t-Statistic of model coefficients)-Output full data set (COMPUTE)

The steps are detailed:

The function mice () begins with a data frame that contains missing data, and then returns an object that contains multiple (default 5) Full datasets.

Each full data set is generated by interpolation of missing data in the original data frame. Because of the random composition of the interpolation, each full data set is slightly different.

Among them, the use of decision tree cart in mice has the following points to note: The method only interpolation of numerical variables, the missing values of categorical variables are retained, cart interpolation method is generally not more than 5k data set.

The WITH () function can then apply a statistical model (such as a linear model or a generalized linear model) to each full data set in turn.

Finally, the pool () function consolidates these individual analysis results into a set of results. Both the standard error and P-value of the final model will accurately reflect the uncertainties caused by missing values and multiple interpolation.

[Plain]View PlainCopy

#多重插补法处理缺失, the results are dumped
Library (Lattice) #调入函数包
Library (MASS)
Library (nnet)
Library (MICE) #前三个包是mice的基础
Imp=mice (inputfile,m=4) #4重插补, which generates 4 no missing datasets
Fit=with (IMP,LM (sales~date,data=inputfile)) #选择插补模型
Pooled=pool (FIT)
Summary (Pooled)
Result4=complete (imp,action=3) #选择第三个插补数据集作为结果

Interpretation of results:

(1) Imp object contains: The number of missing values per variable information, each variable interpolation method (PMM, the prediction mean method is common), the interpolation of variables, the Predictor matrix (in the matrix, the row represents the interpolation variable, the column represents the variables to provide information for interpolation, 1 and 0 respectively for use and not used);

At the same time, using this code imp$imp$sales can be found, each interpolation data set missing value location of the data to fill the exact value of what.

[Plain]View PlainCopy

> Imp$imp$sales
1 2 3 4
9 3614.7 3393.1) 4060.3 3393.1
15 2332.1 3614.7) 3295.5 3614.7

(2) With object. Interpolation model can be diversified, such as LM,GLM can be applied directly, the details of "R language Combat" the 15th chapter;

(3) Pool object. After summary, LM model coefficients can appear, and if the coefficients are not significant, then the interpolation model should be considered.

(4) Complete object. M full interpolation data set, which can also be used to output this function.

Other:

miceThe package provides a good function md.pattern() that allows a better understanding of the patterns of missing data. There are also visual interfaces that display missing values through vim, box plots, and lattice. Visible Blog: Populating missing data in R-mice package

Three, outlier detection

The main difference between outlier detection and second section outliers is that outliers are for single variables, while outliers refer to outliers after many variables are considered. An outlier detection method based on cluster + Euclidean distance is described below.

The steps for outlier detection based on clustering are as follows: Data normalization--clustering--finding the mean point for each of the indicators--generating a matrix for each indicator--calculating Euclidean distance--drawing judgment

Data=read.csv (". Data.csv", header=t) [, 2:4]
Data=scale (Data)
Set.seed (12)
Km=kmeans (data,center=3)
Print (km)
Km$centers #每一类的均值点
#各样本欧氏距离, each row
X1=matrix (Km$centers[1,], nrow = 940, Ncol =3, byrow = T)
Juli1=sqrt (Rowsums ((data-x1) ^2))
X2=matrix (Km$centers[2,], nrow = 940, Ncol =3, byrow = T)
Juli2=sqrt (Rowsums ((data-x2) ^2))
X3=matrix (Km$centers[3,], nrow = 940, Ncol =3, byrow = T)
Juli3=sqrt (Rowsums ((data-x3) ^2))
Dist=data.frame (JULI1,JULI2,JULI3)
# #欧氏距离最小值
Y=apply (Dist, 1, min)
Plot (1:940,y,xlim=c (0,940), xlab= "Sample Point", ylab= "Euclidean distance")
Points (which (y>2.5), Y[which (y>2.5)],pch=19,col= "Red")
Original address: 51210793

R language ︱ outlier test, outlier analysis, outlier processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

R language ︱ outlier test, outlier analysis, outlier processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

R language ︱ outlier test, outlier analysis, outlier processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support