Outlier processing is an important step in data preprocessing, and with the advent of the era of big data, outlier processing is becoming more and more important. This paper mainly summarizes some common methods of judging outliers.
1.3-σ Guidelines
The data is expected to obey normal distribution, and the experimental data values greater than μ+3σ or less than μ-3σ as outliers, where μ is the data mean, σ is the data standard deviation
Matlab code example
% outlier treatment
% using the 3 Sigma method
CLC;
Clear all;
Data0=xlsread (' C:\Users\Administrator\Desktop\ data preprocessing \data1703.xlsx ', ' b2:i266283 ');% read raw data
% variable 1 exception handling
BL1=DATA0 (1:237,1);
[M,n]=size (BL1);
Ave=mean (BL1);% mean
sigma=sqrt ((BL1 '-ave) * (Bl1-ave)/m);% standard deviation
fangcha=sigma^2;% variance
Jicha=max (BL1)-min ( BL1);% very poor
sx=ave+3*sigma;
Xx=ave-3*sigma;
Ycz=[];
Zcz=[];
S=1;
s1=1;
For i=1:m
if BL1 (i,1) <xx| | BL1 (i,1) >sx
Ycz (s,1) =bl1 (i,1);
Ycz (s,2) =i;
s=s+1;
End
If BL1 (i,1) <sx&&bl1 (i,1) >xx
Zcz (s1,1) =bl1 (i,1);
Zcz (s1,2) =i;
s1=s1+1;
End
End
2, Box line diagram method
The simple box plot consists of five parts, namely the minimum, median, maximum, and two four-bit digits. The 14th Q1: Also known as the "lower four", is equal to the number of all values in the sample from small to large after the 25th. Median F: Also known as the 24th cent (Q2), also known as "median", is equal to the number of all values in the sample from small to large after the 50th. The 34th division: Also known as "Four", is equal to the number of all values in the sample from small to large after the 75th.
Outliers are defined as values that are less than Q1-1.5IQR or greater than Q3+1.5IQR. Although this criterion is somewhat arbitrary, it derives from empirical judgment, and experience has shown that it is doing well in dealing with data that requires special attention.
Matlab code example
% outlier treatment
% using box-line method
CLC;
Clear all;
Data0=xlsread (' C:\Users\Administrator\Desktop\ data preprocessing \data1703.xlsx ', ' b2:i266283 ');% read raw data
[M,n]=size (DATA0) ;
W1=round (M/4); % 14th bit position,
%m1=m/2% median position,
w3=round (3*M/4),% 34th sub-position
% Variable 1 outlier handling
bl1=data0 (:, 1);
[A1,b1]=sort (BL1);%[a,b]=sort (x); is arranged from small to large, a is the result of sorting, B is the original position of each element in a result.
q11=a1 (w1,1); % 14th-digit
q13=a1 (w3,1); % of the 34th
Qr1=q13-q11; % four min. distance
sx1=q13+1.5*qr1; % Upper
xx1=q11-1.5*qr1; % lower bound
ycz1=[];% outlier matrix
s1=1;
For i=1:m
if BL1 (i,1) >sx1| | BL1 (i,1) <xx1
ycz1 (s1,1) =bl1 (i,1);
YCZ1 (s1,2) =i;
s1=s1+1;
End
End
3. Grubbs test method
If an outlier is checked out, the remaining values after removal of the detected outliers continue to be tested with the Grubbs test until the outliers cannot be checked out.
4, Markov distance method
The following figure is a method step
The following figure is the Chi-square distribution table
Matlab code example
% outlier treatment
% using Markov distance method
CLC;
Clear all;
Data0=xlsread (' C:\Users\Administrator\Desktop\ data preprocessing \data1703.xlsx ', ' b2:i241 ');% read raw data
Ave=mean (DATA0);% If A is a matrix, mean (a) treats each column as a vector, treats each column in the matrix as a vector, and returns a row vector containing the average of all the elements in each column.
[M,n]=size (DATA0);
% computed matrix covariance
Xfc=cov (DATA0);
%XFCNI=INV (XFC);% calculates the inverse delta=zeros of the matrix covariance
(m,n);
For i=1:m
Delta (i,:) =data0 (i,:)-ave (1,:);% calculates the difference between the sample and the mean
end
%deltazz=delta ';%n*m, the inverse of the difference between the sample and the mean value
Calculate the Markov distance
Msjl=zeros (m,1);
For i=1:m
MSJL (i,1) =delta (i,:)/xfc* (Delta (i,:) ');
End
s=0;
For i=1:m
if MSJL (i,1) >2.18% The confidence level is 0.975, the degree of freedom is 8, the corresponding Chi-square value
s=s+1;
End
End
5. Other methods
You can refer to the document "Statistical processing of GBT 4883-2008 data and interpretation of the determination and processing of the outliers of normal samples"
Note: Some of the content refers to other bloggers of the article, here to express thanks.