The advanced method of the R language processing missing data

The advanced method of the R language processing missing data _r

Last Update:2018-08-23 Source: Internet

Author: User

Tags true true

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The main use of vim and mice package

Install.packages (C ("VIM", "mice"))

1. Procedures for handling missing values

Steps:

(1) identify missing data;

(2) Checking the cause of data loss;

(3) Deleting an instance that contains missing values or substituting a reasonable number for (interpolation) missing values

Classification of missing value data:

(1) Complete random deletion: If the missing data of a variable is unrelated to any other observed or unobserved variable, the data is completely random missing (Mcar).

(2) Random deletion: If the missing data on a variable is related to other observational variables and is not related to its own unobserved value, the data is random missing (MAR).

(3) Non-random deletion: If the missing data does not belong to Mcar or Mar, the data is not random missing (Nimar). 2. Identify missing values

NA: Represents the missing value;

NaN: Represents an impossible value;

INF: Representing positive infinity;

-inf: Represents negative infinity.

Is.na (): Identify missing value;

Is.nan (): Identify the impossible value;

Is.infinite (): Infinite value.

Example of return values for Is.na (), Is.nan (), and Is.infinte () functions

x is.na (x) Is.nan (x) is.infinite (x) X<-na true false False x<-0/0 True true false x<-1/0 false true

Complete.cases () can be used to identify rows with no missing values in a matrix or data box, and if each row contains a complete instance, a logical vector that returns TRUE, False if each row has one or more missing values; 3. Explore missing value mode (1) List show missing value

The Md.pattern () function in the MICE package generates a table that displays the missing value pattern as a matrix or a data box

Library (MICE)
data (sleep,package= "VIM")
Md.pattern (Sleep)

(2) Graphic exploration of missing data

The VIM package provides a large number of functions that can visualize missing value patterns in the dataset: Aggr (), Matrixplot (), Scattmiss ()

Library ("VIM")
Aggr (sleep,prop=false,numbers=true)

Library ("VIM")
Aggr (sleep,prop=true,numbers=true) #用比例代替了计数

Matrixplot () function to generate a graphic that shows each instance data

Matrixplot (Sleep)

A light color indicates a small value, a dark color indicates a large value, and the default missing value is red.

The Marginplot () function generates a scatter graph showing missing value information for two variables at the graph boundary.

Library ("VIM")
Marginplot (Sleep[c ("Gest", "Dream")],pch=c, Col=c ("Darkgray", "Red", "blue")

(3) To explore the missing value by correlation

Shadow matrices: Use indicator variables to replace data in a dataset (1 for missing, 0 for existence) so that the resulting matrix is sometimes called a shadow matrix.

The correlations between these indicator variables and their initial (observable) variables are observed, and the variables are often missing together, and the relationship between the analysis variable "missing" and the other variables is evaluated.

Head (Sleep)
str (sleep)
X<-as.data.frame (ABS (IS.NA)) head
(sleep,n=5) head (
x,n=5)
Y<-x[which (SD (x) >0)]
cor (y)
cor (sleep,y,use= "Pairwise.complete.obs")

4. Understanding the origin and impact of missing value data

The number, distribution, and pattern of missing data are identified for two purposes:

(1) Analyzing the potential mechanism of generating missing data;

(2) Evaluate the impact of missing data on answering substantive questions.

That

(1) The proportion of missing data is large.

(2) Whether the missing data is concentrated on a few variables or is widely available.

(3) is the lack of random generation?

(4) Whether the correlation between the missing data or the observable data can indicate the mechanism for generating the missing value.

If the missing data is concentrated on several relatively unimportant variables, you can delete the variables and then perform the normal data analysis.

If a small amount of data is randomly distributed across the entire dataset (Mcar), an example of complete data can be analyzed so that reliable and effective results are still available;

If the assumed data is Mcar or mar, the multiple interpolation method can be used to obtain a shovel conclusion.

If the data is Nmar, a special approach is needed to collect new data or to join a relatively easier and more profitable industry. 5. Rational processing Incomplete Data 6. Complete instance analysis (row deletion)

function complete.cases (), Na.omit () can be used to store instances (rows) in a data box or matrix form without missing values:

Newdata<-mydata[complete.cases (MyData),]
newdata<-na.omit (MyData)

Options (Digits=1)
cor (na.omit (sleep))
Cor (sleep,use= "Complete.obs")

FIT<-LM (Dream~span+gest,data=na.omit)
Summary (FIT)

7. Multiple interpolation

Multiple interpolation (MI) is a method of processing missing values based on repetitive simulations.

MI generates a complete set of datasets from a dataset that contains missing values. In each emulated dataset, the missing data is filled using the Monte Carlo method.

At this point, the standard statistical methods can be applied to each simulated data set, the results of the estimation are given by combining the output, and the confidence respect of the missing value is introduced.

Packages available to Amelia, mice and MI packs

The mice () function starts with a data box that contains missing data, and then returns an object that contains more than one full dataset. Each complete dataset is generated by inserting the missing data from the original data box.

The WITH () function can then apply a statistical model to each complete dataset in turn

The pool () function consolidates these individual analysis results into a single set of results.

Both the standard error and P value of the final model will accurately reflect the uncertainty caused by the missing value and multiple interpolation.

Analysis based on mice packages usually conforms to the following analysis process:

Library (MICE)
Imp<-mice (mydata,m)
fit<-with (imp,analysis)
Pooled<-pool (Fit)
Summary ( Pooled)

MyData is a matrix or data frame of a dietary missing value;

IMP is a list object that contains m interpolation datasets and also contains information to complete the interpolation process, the default m=5

Analysis is an expression object that is used to set up a statistical analytical method applied to M interpolation. The methods include the LM () function of the linear regression model, the GLM () function of the generalized linear model, the GAM () which makes the generalized additive model, and the NBRM () function for negative two-item models.

Fit is a list object containing the results of M-separate statistical analysis;

Pooled is a list object that contains the average result of this m statistic analysis.

</pre><pre name= "code" class= "plain" >library (MICE)
data (sleep,package= "VIM")
Imp<-mice ( sleep,seed=1234)

Fit<-with (IMP,LM (dream~span+gest))
Pooled<-pool (Fit)
Summary (Pooled)

Imp

Imp$imp$dream

Use the complete () function to observe any one of the M interpolation datasets, in the form of: complete (imp,action=#)

Dataset3<-complete (imp,action=3)
Dataset3

8. Other ways to deal with missing values

Professional methods for dealing with missing data

Package Description Hmisc contains a number of functions that support simple interpolation, multiple interpolation, and a typical variable interpolation mvnmle multiple interpolation arrayimpute\ of multiple class variables in the cat logarithmic linear model with the maximum likelihood of the missing values in the multivariate normal data Arrarymisspattern, seqknn the utility function that deals with the missing value data of the microarray longitudinaldata a list of functions, such as a series of functions that interpolate the missing values of the time series KMI Kaplan-meier multiple interpolation mix for dealing with missing values of survival analysis multiple interpolation pan multi-panel data or clustering interpolation for mixed class and continuous data in general position model
(1) When a pair deletes a dataset with a missing value, it is used as an alternative to deleting the row frequently. For a pair deletion, the observation is only deleted if the variable that contains the missing data involves a particular analysis.

Cor (sleep,use= "Pairwise.complete.obs")

Although a pair deletion seems to take advantage of all available data, it is not recommended to use this method because it actually uses only a different set of data per calculation, which leads to some distortions. (2) simple (not random) interpolation simple interpolation, that is, a value (such as mean, median or public number) to replace the missing values in the variable. Note that the substitution is not random, which means that random errors (unlike multiple foil) will not be introduced. One advantage of simple interpolation is that solving "missing value problems" does not reduce the amount of samples available during the profiling process. Although simple interpolation usage is simple, it can produce biased results for mcar data. If the number of missing data is very large, then simple interpolation is likely to underestimate the standard deviation, distort the correlation between the variables, and generate an incorrect statistical test of the P value. The method should be avoided as far as possible.

Production of publication grade quality output in 9.R

Common methods: Sweave and Odfweave.

The Sweave package embeds r code and output into the Latex document to obtain high-quality typesetting reports in PDF, PostScript, and DVI formats.

The Odfweave package embeds the R code and output into the document for ODF (Open documents Format)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More