Exception value Handling

Last Update:2018-08-02 Source: Internet

Author: User

Tags exception handling min

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Outlier processing is an important step in data preprocessing, and with the advent of the era of big data, outlier processing is becoming more and more important. This paper mainly summarizes some common methods of judging outliers.
1.3-σ Guidelines
The data is expected to obey normal distribution, and the experimental data values greater than μ+3σ or less than μ-3σ as outliers, where μ is the data mean, σ is the data standard deviation
Matlab code example

% outlier treatment
% using the 3 Sigma method
CLC;
Clear all;
Data0=xlsread (' C:\Users\Administrator\Desktop\ data preprocessing \data1703.xlsx ', ' b2:i266283 ');% read raw data

% variable 1 exception handling
BL1=DATA0 (1:237,1);
[M,n]=size (BL1);
Ave=mean (BL1);% mean
sigma=sqrt ((BL1 '-ave) * (Bl1-ave)/m);% standard deviation
fangcha=sigma^2;% variance
Jicha=max (BL1)-min ( BL1);% very poor

sx=ave+3*sigma;
Xx=ave-3*sigma;
Ycz=[];
Zcz=[];
S=1;
s1=1;
For i=1:m
    if BL1 (i,1) <xx| | BL1 (i,1) >sx
        Ycz (s,1) =bl1 (i,1);
        Ycz (s,2) =i;
        s=s+1;
    End
    If BL1 (i,1) <sx&&bl1 (i,1) >xx
        Zcz (s1,1) =bl1 (i,1);
        Zcz (s1,2) =i;
        s1=s1+1;
    End
End

2, Box line diagram method
The simple box plot consists of five parts, namely the minimum, median, maximum, and two four-bit digits. The 14th Q1: Also known as the "lower four", is equal to the number of all values in the sample from small to large after the 25th. Median F: Also known as the 24th cent (Q2), also known as "median", is equal to the number of all values in the sample from small to large after the 50th. The 34th division: Also known as "Four", is equal to the number of all values in the sample from small to large after the 75th.

Outliers are defined as values that are less than Q1-1.5IQR or greater than Q3+1.5IQR. Although this criterion is somewhat arbitrary, it derives from empirical judgment, and experience has shown that it is doing well in dealing with data that requires special attention.
Matlab code example

% outlier treatment
% using box-line method
CLC;
Clear all;
Data0=xlsread (' C:\Users\Administrator\Desktop\ data preprocessing \data1703.xlsx ', ' b2:i266283 ');% read raw data
[M,n]=size (DATA0) ;
W1=round (M/4);  % 14th bit position,
%m1=m/2% median position,
w3=round (3*M/4),% 34th sub-position

% Variable 1 outlier handling
bl1=data0 (:, 1);
[A1,b1]=sort (BL1);%[a,b]=sort (x); is arranged from small to large, a is the result of sorting, B is the original position of each element in a result.
q11=a1 (w1,1);    % 14th-digit
q13=a1 (w3,1);    % of the 34th
Qr1=q13-q11;       % four min. distance
sx1=q13+1.5*qr1;   % Upper
xx1=q11-1.5*qr1;   % lower bound
ycz1=[];% outlier matrix
s1=1;
For i=1:m
    if BL1 (i,1) >sx1| | BL1 (i,1) <xx1
        ycz1 (s1,1) =bl1 (i,1);
        YCZ1 (s1,2) =i;
        s1=s1+1;
    End
End

3. Grubbs test method

If an outlier is checked out, the remaining values after removal of the detected outliers continue to be tested with the Grubbs test until the outliers cannot be checked out.
4, Markov distance method
The following figure is a method step

The following figure is the Chi-square distribution table

Matlab code example

% outlier treatment
% using Markov distance method
CLC;
Clear all;
Data0=xlsread (' C:\Users\Administrator\Desktop\ data preprocessing \data1703.xlsx ', ' b2:i241 ');% read raw data
Ave=mean (DATA0);% If A is a matrix, mean (a) treats each column as a vector, treats each column in the matrix as a vector, and returns a row vector containing the average of all the elements in each column.
[M,n]=size (DATA0);
% computed matrix covariance
Xfc=cov (DATA0);
%XFCNI=INV (XFC);% calculates the inverse delta=zeros of the matrix covariance

(m,n);
For i=1:m 
  Delta (i,:) =data0 (i,:)-ave (1,:);% calculates the difference between the sample and the mean
end
%deltazz=delta ';%n*m, the inverse of the difference between the sample and the mean value

Calculate the Markov distance
Msjl=zeros (m,1);
For i=1:m
    MSJL (i,1) =delta (i,:)/xfc* (Delta (i,:) ');
End
s=0;
For i=1:m
    if MSJL (i,1) >2.18% The confidence level is 0.975, the degree of freedom is 8, the corresponding Chi-square value
        s=s+1;
    End
End

5. Other methods
You can refer to the document "Statistical processing of GBT 4883-2008 data and interpretation of the determination and processing of the outliers of normal samples"

Note: Some of the content refers to other bloggers of the article, here to express thanks.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More