Processing _r language of missing value Na in R language

Source: Internet
Author: User

Typically in a project, data may be incomplete due to device failure, unanswered questions, or incorrectly encoded data. Na (not available, unavailable) in R represents a missing value.

function Is.na () allows you to detect whether a missing value exists. This function, which is used to detect objects, returns an object of the same size, where the location of the missing value is rewritten to true, and the other is false if it is not a missing value.

> Which (is.na (nhanes2)) #返回缺失值的位置

> Sum (is.na (nhanes2)) #计算数据集nhanes2中的缺失值总数

> Sum (complete.cases (nhanes2)) #统计数据集中完整样本的个数

        The distribution of missing values (mice packets) can be obtained by Md.pattern (), where 1 indicates that there is no missing data and 0 indicates that there is missing data:


The first row of the first column of 13 indicates that 13 sample data is complete, the penultimate line 7 indicates that 7 samples are missing HYP, BMI, CHL Three variables, and the last line represents the total number of samples missing from each variable. 1. Remove missing parts

After you determine the missing values, delete the missing values before profiling the data. Because arithmetic expressions and functions that contain missing values are computed as missing values.

Some functions are computed with na.rm=true, which can be computed by removing the missing value before the calculation, and by using the function complete.cases () to check the number of observations that contain at least one missing data in the variable. The function compete.cases () produces a Boolean vector that has the same number of elements as the rows in the data box, and if the response row of the Data box contains no NA value, the function return value is true.

function Na.omit () Removes all observations that contain missing values, Na.omit () can delete all rows that contain missing data. But it's extremely extreme to kick out all the records that contain missing values, so it makes little sense to handle samples with too many missing values.

Manynas (algae,0.2) can find rows in the dataset algae that have a number of missing values greater than 20% of the columns, and the second argument can set an exact number of columns as the bounds. 2. To fill the missing value with the highest frequency

Another way to fill in records with missing values is to try to find the most likely values for those missing values. The average value can be chosen for the approximate distribution of variable distributions, and the median is generally used to represent the trend of data center.

The sample Algae 48th row variable mxph has the missing value, the variable distribution approximate normal distribution, uses the average value to fill the missing values:

> algae[48, "mxph"] <-mean (algae$mxph,na.rm = T)
        where mean () calculates the average of the numerical vectors.

CHLA the variable using the median to fill all missing values in the column instead of filling one line at a time as above:

> algae[is.na (Algae$chla), "Chla"] <-median (algae$chla,na.rm = T)

function centralimputation () can fill all missing values in a dataset with the center trend value of the data:

> Algae <-centralimputation (algae)

The numerical variable uses the median, the nominal variable adopts the public number. This method is simple and fast, applicable to large datasets, but may result in large data deviations. 3. Fill the missing value by the correlation relation of the variable

By exploring the correlation between the value of the variable to obtain the missing value less deviation estimate. The function of the function Cor () is to produce the correlation value matrix between the variables, the following can get the correlation value between the variables:

> Cor (algae[,4:18],use = "Complete.obs")

The parameter use = "Complete.obs" can omit records containing NA, and the function symnum () can output the form of a symbol-related value. In a test dataset, two variables with a correlation value greater than 0.9 can fill the missing values of the two variables by correlation. Taking P04 and oP04 as examples, we need to first find the linear correlation between these two variables:

> lm (po4~opo4,data = algae)

function lm () can be used to obtain a linear model. Get Po4=42.897+1.293*opo4

You can use the above linear relationship to calculate the missing value of the variable to fill the missing value of the sample 28 on the variable PO4:

> algae[28, "PO4"] <-42.897+1.293*algae[28, "oPO4"]

    You can construct a function to apply this function to all missing values:
4. To fill the missing value by exploring the similarity between cases

Try to use the similarity between rows (observed values) to fill missing values. Similarity is often defined by variables that describe the multivariate metric space of observations, in which Euclidean distances can be informally defined as the sum of squares of the difference between observations of any two cases. You can use this metric to find 10 samples of water that are most similar to any missing cases and fill them with missing values.

Apply the function knnimputation () to fill the missing value with the median number:

> algae<-knnimputation (algae,k=10,meth = "median")

Where the parameter meth is optional. There are other problems in bridging the missing value by the similarity between the cases: there may be unrelated variable distortion similarity, and even the computation of large datasets is particularly complex. 5. Replace missing value with missing value of similar sample

> Donate <-Nhanes2[which (Apply (Is.na (nhanes2), 1,sum) ==0),]

#无缺失值的样本

> Accept <-nhanes2[which (Apply (Is.na (nhanes2), 1,sum)!=0),]

#存在缺失值的样本

> SA <-Donate[which (donate[,1]==accept[2,1]&donate[,3]==accept[2,3]&donate[,4]==accept[2,4]),]

For the second sample in Accept, accept[2,] do the missing value processing, look for its similar sample, replace the missing value with the corresponding value of the found sample:

> Sa<-donate[which (donate[,1]==accept[2,1]&donate[,3]==accept[2,3]&donate[,4]==accept[2,4]),]


If the data volume is too large, the data can be layered according to some variables, and the mean interpolation is used in the layer to the missing value, that is, the cold platform interpolation method is adopted.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.