Exception value Handling

Source: Internet
Author: User
Tags exception handling min

Outlier processing is an important step in data preprocessing, and with the advent of the era of big data, outlier processing is becoming more and more important. This paper mainly summarizes some common methods of judging outliers.
1.3-σ Guidelines
The data is expected to obey normal distribution, and the experimental data values greater than μ+3σ or less than μ-3σ as outliers, where μ is the data mean, σ is the data standard deviation
Matlab code example

% outlier treatment
% using the 3 Sigma method
CLC;
Clear all;
Data0=xlsread (' C:\Users\Administrator\Desktop\ data preprocessing \data1703.xlsx ', ' b2:i266283 ');% read raw data

% variable 1 exception handling
BL1=DATA0 (1:237,1);
[M,n]=size (BL1);
Ave=mean (BL1);% mean
sigma=sqrt ((BL1 '-ave) * (Bl1-ave)/m);% standard deviation
fangcha=sigma^2;% variance
Jicha=max (BL1)-min ( BL1);% very poor

sx=ave+3*sigma;
Xx=ave-3*sigma;
Ycz=[];
Zcz=[];
S=1;
s1=1;
For i=1:m
    if BL1 (i,1) <xx| | BL1 (i,1) >sx
        Ycz (s,1) =bl1 (i,1);
        Ycz (s,2) =i;
        s=s+1;
    End
    If BL1 (i,1) <sx&&bl1 (i,1) >xx
        Zcz (s1,1) =bl1 (i,1);
        Zcz (s1,2) =i;
        s1=s1+1;
    End
End

2, Box line diagram method
The simple box plot consists of five parts, namely the minimum, median, maximum, and two four-bit digits. The 14th Q1: Also known as the "lower four", is equal to the number of all values in the sample from small to large after the 25th. Median F: Also known as the 24th cent (Q2), also known as "median", is equal to the number of all values in the sample from small to large after the 50th. The 34th division: Also known as "Four", is equal to the number of all values in the sample from small to large after the 75th.

Outliers are defined as values that are less than Q1-1.5IQR or greater than Q3+1.5IQR. Although this criterion is somewhat arbitrary, it derives from empirical judgment, and experience has shown that it is doing well in dealing with data that requires special attention.
Matlab code example

% outlier treatment
% using box-line method
CLC;
Clear all;
Data0=xlsread (' C:\Users\Administrator\Desktop\ data preprocessing \data1703.xlsx ', ' b2:i266283 ');% read raw data
[M,n]=size (DATA0) ;
W1=round (M/4);  % 14th bit position,
%m1=m/2% median position,
w3=round (3*M/4),% 34th sub-position

% Variable 1 outlier handling
bl1=data0 (:, 1);
[A1,b1]=sort (BL1);%[a,b]=sort (x); is arranged from small to large, a is the result of sorting, B is the original position of each element in a result.
q11=a1 (w1,1);    % 14th-digit
q13=a1 (w3,1);    % of the 34th
Qr1=q13-q11;       % four min. distance
sx1=q13+1.5*qr1;   % Upper
xx1=q11-1.5*qr1;   % lower bound
ycz1=[];% outlier matrix
s1=1;
For i=1:m
    if BL1 (i,1) >sx1| | BL1 (i,1) <xx1
        ycz1 (s1,1) =bl1 (i,1);
        YCZ1 (s1,2) =i;
        s1=s1+1;
    End
End

3. Grubbs test method

If an outlier is checked out, the remaining values after removal of the detected outliers continue to be tested with the Grubbs test until the outliers cannot be checked out.
4, Markov distance method
The following figure is a method step

The following figure is the Chi-square distribution table

Matlab code example

% outlier treatment
% using Markov distance method
CLC;
Clear all;
Data0=xlsread (' C:\Users\Administrator\Desktop\ data preprocessing \data1703.xlsx ', ' b2:i241 ');% read raw data
Ave=mean (DATA0);% If A is a matrix, mean (a) treats each column as a vector, treats each column in the matrix as a vector, and returns a row vector containing the average of all the elements in each column.
[M,n]=size (DATA0);
% computed matrix covariance
Xfc=cov (DATA0);
%XFCNI=INV (XFC);% calculates the inverse delta=zeros of the matrix covariance

(m,n);
For i=1:m 
  Delta (i,:) =data0 (i,:)-ave (1,:);% calculates the difference between the sample and the mean
end
%deltazz=delta ';%n*m, the inverse of the difference between the sample and the mean value

Calculate the Markov distance
Msjl=zeros (m,1);
For i=1:m
    MSJL (i,1) =delta (i,:)/xfc* (Delta (i,:) ');
End
s=0;
For i=1:m
    if MSJL (i,1) >2.18% The confidence level is 0.975, the degree of freedom is 8, the corresponding Chi-square value
        s=s+1;
    End
End

5. Other methods
You can refer to the document "Statistical processing of GBT 4883-2008 data and interpretation of the determination and processing of the outliers of normal samples"

Note: Some of the content refers to other bloggers of the article, here to express thanks.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.