Missing value processing in data analysis ~r language _r language

Source: Internet
Author: User

Recently received some real data, the data contains a lot of missing values, how to deal with the missing value, better for us to do data analysis, more efficient modeling, reduce the test set on the forecast analysis of deviation, of course, the smaller the deviation we must be happier. Data preparation

I'm using a geographical sample of data with coordinates, various material components (CA,N,P, etc.).

There are several ways to test for missing data. First type:

Library (VIM)

Aggr (env,prop=t,numbers=t)

function usage, you can help () or? function after loading the package at the console


A look at this data is not a collapse, but the missing is still relatively serious, the second method:

Using the Md.pattern (data) in the MICE package


How to interpret this, in fact, the last line returned is missing number, 98 is a total number of missing values.

What are the processing methods for missing values? I mainly take notes 1, delete missing values

In fact, this method can only choose to delete if you have a large amount of data for model training, such as using Na.omit () or

Set up Na.action=na.omit when modeling. Of course, the actual business in the hands of the time, the amount of data is not large, or you delete

After the missing value, the model can not explain the business well, and the re-identification of the missing value is considered. 2. Delete individual variables

For some really serious data, such as the number of missing values exceeds the percentage you've identified in the business, you can delete the variable. But the data I've recently received is so long that this variable is important to the model you're building,

To delete this variable, we need to consider the position of the variable in the model and the number of training and testing to make a choice. 3, the ordinary method of interpolation

Why do I have to say the ordinary way. Using the method I used later to interpolate, this method is slightly rough, I do not deny

Each of these methods has its own application scenario.

Library (HMISC)

Impute (Env$ca,mean) ### #平均值
Impute (Env$k,median) ### #中位数
Impute (Env$p,zs) ### #众 number here Zs<-ms (env$p)
Impute (env$n, "random") ### #随 Machine

Of course there is also this function in the e1071. And also

The numbers need to be calculated by themselves.

MS <-function (x) {return (As.numeric (Names (table (x)) [table (x) = max (table (x))])}

Of course you have to use this method to do interpolation data directly,

Env$ca<-impute (Env$ca,mean) ### #平均值, suitable for near normal distribution
Env$k<-impute (Env$k,median) ### #中位数, the partial state is not very serious
Number of Env$p<-impute (ENV$P,ZS) ### #众
Env$n<-impute (env$n, "random") ### #随 Machine

This produces a complete dataset of data. The analysis of things can go on. Good luck.

Of course, whether or not such data is generated is reasonable. How to test my interpolation data is objective or almost reasonable.

At this point, we need to compute the interpolation precision. We need to introduce DMWR package install.packages ("DMWR"), library (DMWR)

Referring to this package, which has Manynas (data,0.2) This function returns a row that finds the missing value is greater than the column number 20%, and this 0.2 is

to the tune.

The Regr.eval () function that requires the DMWR package to be used in calculating the interpolation effect


4, in the DMWR bag has the centralimputation () This function is uses the data the central trend value to fill the missing value.

Look at the gap with the original data, here I use the Performanceanalytics package chart.correlation () function


The above figure is the Centralimputation () function interpolation after the distribution, look at the interpolation effect



The mape value is bigger than the previous common method, it seems that the effect is not good AH

Look at the source data again


In general, there is no correlation, and there is a general distribution. 5, K nearest Neighbor method

The Knnimputation () function in the DMWR packet finds K and its nearest observed value based on Euclidean distance, then uses the distance inverse weighting to obtain the interpolation value for the K nearest neighbor, then replaces the missing value in the source data. Knnoutput <-knnimputation (env)

This seems to be a little scary place, just slightly changing the value of the correlation coefficient, some have not been changed, and the correlation of the variables is only slightly different.


We see the mape value is dropped, it seems that the interpolation effect is really in the Ascension AH. 6, Rpart

Using the decision tree to predict the missing value, it is possible to interpolate the factor class variable relative to the previous advantage, and the centralimputation () function is also possible.

But for nominal variables it takes a lot of numbers.

Use the Rpart () function for numeric variables (Method=anova), Factor-type variables (method=class). Need to pay attention to how method is used


Looked at, the effect does not have K nearest neighbor method good. 7, Mice

This mainly uses the mice () function to model, uses the complete () function to generate the complete data



Looking at the code, mice is using random forests when modeling. Finally calculate the next interpolation effect, no NA.

Then I change the parameter of method to Method= "norm" (normal distribution)


Then I calculated the interpolation effect, big unsatisfactory ah, because we from the source data distribution to see missing indicators do not

into a normal distribution, so I can understand.

For mice (), continue


I look at my interpolation effect, at this point I prefer the densityplot () function in the lattice package


Look, the prediction of n is not very good, but the trend forecast is quite ok. The loss prediction analysis of other variables is still good. 8, there are packages and methods

Package General Description Hmisc
For a variety of functions, support simple interpolation, multiple interpolation and typical variable interpolation
Longitudinaldata
A series of functions to interpolate the missing value of time series
Pan
Multiple interpolation of multi-panel data or clustering
Kmi
Multiple interpolation of Kaplan-meier to deal with the missing value of survival analysis
Cat
Multiple interpolation with multivariate class-type variables in logarithmic linear model
Mvnmle
Maximum likelihood estimation of missing values in multivariate normal-state data
。。。 。。。

This is the present I have contacted, some I have used, and then continue to add. 9, back to the problem itself

For the missing value processing, different data sources and different business requirements, the method used is certainly not the same, of course, we need to deal with the missing value of the process to maintain an objective judgment, refueling.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.