Rstudio-methods for dealing with missing values _r language

Source: Internet
Author: User

1. Excluding cases with missing values (rows)

Algae[!complete.case (algae),]% find all cases with missing values in the algae data set

There are two kinds of culling: one is to remove all the cases with missing values, and the other is to eliminate the cases with more missing values.

(1) Delete all cases with missing values in the algae DataSet: Algae <-na.omit (algae)

(2) Eliminate the missing value of algae data set more cases

step1:manynas<-Manynas (algae,0.2)

% gives the number of rows in which the algae data set is missing, where 0.2 indicates that the missing attribute in one case occupies 20% of the total property, which is the default value, and the user can set it according to his or her own needs.

STEP2:ALGAE1 <-Algae[-manynas,]

% deletes cases with more missing values in the algae dataset and stores the results in ALGAE1.

2. Fill in cases with missing values

(1) Fill with the central trend value

The most concise way to fill missing data is to select a value that represents the trend of concentration, while the concentration trend values are average, median, and so on, the specific choice of which depends on the specific circumstances. For data that obeys normal distribution, the average value is the best. For the distribution of skewness or outliers, the median is a better indicator to represent the trend of the data center. For the distribution of skewness or outliers, the median is a better indicator to represent the trend of the data center.

algae[48, "mxph"] <-mean (algae$mxph, na.rm=t)% algae The Mxph attribute value of the 48th case in the dataset is populated with the average value of the Mxph attribute (delete missing record) (fill a missing value)

Algae[is.na (Algae$chla), "Chla"] <-median (algae$chla,na.rm=t)% algae The missing value of the Chla property in the dataset is populated with the median of the Chla attribute (fill a list of missing values)

Note: The test data in R obeys the normal distribution method can refer to-use R to detect whether the data conforms to the normal distribution (http://blog.csdn.net/S_gy_Zetrov/article/details/69488483). This article only gives the Shapiro-Wilke (Shapiro-wilk) test method (also known as the W test), the specific steps:

step1:algae<-na.omit (algae)% Deletes a case with a missing value from the algae dataset

Step2:shapiro.test (ALGAE$MXPH)% determines whether the mxph attribute of the algae dataset obeys a normal distribution, and the result, as shown in Figure 1, compares p-value in the result with α (typically 0.05), if P-value >α, Then obey the normal distribution, or not obey.

Fig. 1 Shapiro-wilk test method

(2) Fill in according to the correlation relation between the variables

Step1:cor (algae[,4:18],use= "Complete.obs")

% calculates the correlation of columns 4th through 18th of the algae dataset, and the parameter use= "Complete.obs" allows R to ignore records containing NA when calculating related values.

Step2:symnum (Cor (algae[,4:18],use = "Complete.obs"))

% converts the result of a numerical form into the form of Figure 2. It is shown from FIG. 2 that PO4 and OPO4 are strongly correlated, and the correlation coefficients are between 0.9~0.95.

Figure 2 Correlation of columns 4th to 18th of algae datasets

STEP3:LM (Po4~opo4,data=algae)

% establish a linear relationship between PO4 and OPO4, as shown in Figure 3, as illustrated by Figure 3: PO4 = 45.602 + 1.278xopo4

Fig. 3 linear relationship between PO4 and oPO4

STEP4: According to PO4 and OPO4 linear relationship, fill PO4 with OPO4.

> Algae<-algae[-manynas (algae),]% delete cases with more missing values in algae data set

> fillpo4<-function (OP) {
+ if (is.na (OP))
+ RETURN (NA)
+ Else return (45.602+1.278*OP)
> algae[is.na (ALGAE$PO4), "PO4"]<-sapply (Algae[is.na (ALGAE$PO4), "OPO4"],fillpo4)
> Algae

(3) based on the similarity between the case to fill

The common function knnimputation () in R uses the similarity of the case (row) to populate the missing value. It finds the nearest K neighbor of any case according to the KNN algorithm, and fills the missing value by setting the function value (usually selecting mean, median, number, etc.) in the nearest neighbor case. Use the following methods:

Algae<-knnimputation (algae,k=10,meth= "median")% selects 10 nearest-neighbor cases and populates the missing values with the median number of these 10 cases.

Note: The collation of the knowledge point is to understand the problem in depth, and the second is to facilitate the follow-up search data. It is my pleasure to be of help to your doubt.
This article references http://www.cnblogs.com/cloudtj/articles/5508367.html



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.