Rstudio-methods for dealing with missing values

Rstudio-methods for dealing with missing values _r language

Last Update:2018-08-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Excluding cases with missing values (rows)

Algae[!complete.case (algae),]% find all cases with missing values in the algae data set

There are two kinds of culling: one is to remove all the cases with missing values, and the other is to eliminate the cases with more missing values.

(1) Delete all cases with missing values in the algae DataSet: Algae <-na.omit (algae)

(2) Eliminate the missing value of algae data set more cases

step1:manynas<-Manynas (algae,0.2)

% gives the number of rows in which the algae data set is missing, where 0.2 indicates that the missing attribute in one case occupies 20% of the total property, which is the default value, and the user can set it according to his or her own needs.

STEP2:ALGAE1 <-Algae[-manynas,]

% deletes cases with more missing values in the algae dataset and stores the results in ALGAE1.

2. Fill in cases with missing values

(1) Fill with the central trend value

The most concise way to fill missing data is to select a value that represents the trend of concentration, while the concentration trend values are average, median, and so on, the specific choice of which depends on the specific circumstances. For data that obeys normal distribution, the average value is the best. For the distribution of skewness or outliers, the median is a better indicator to represent the trend of the data center. For the distribution of skewness or outliers, the median is a better indicator to represent the trend of the data center.

algae[48, "mxph"] <-mean (algae$mxph, na.rm=t)% algae The Mxph attribute value of the 48th case in the dataset is populated with the average value of the Mxph attribute (delete missing record) (fill a missing value)

Algae[is.na (Algae$chla), "Chla"] <-median (algae$chla,na.rm=t)% algae The missing value of the Chla property in the dataset is populated with the median of the Chla attribute (fill a list of missing values)

Note: The test data in R obeys the normal distribution method can refer to-use R to detect whether the data conforms to the normal distribution (http://blog.csdn.net/S_gy_Zetrov/article/details/69488483). This article only gives the Shapiro-Wilke (Shapiro-wilk) test method (also known as the W test), the specific steps:

step1:algae<-na.omit (algae)% Deletes a case with a missing value from the algae dataset

Step2:shapiro.test (ALGAE$MXPH)% determines whether the mxph attribute of the algae dataset obeys a normal distribution, and the result, as shown in Figure 1, compares p-value in the result with α (typically 0.05), if P-value >α, Then obey the normal distribution, or not obey.

Fig. 1 Shapiro-wilk test method

(2) Fill in according to the correlation relation between the variables

Step1:cor (algae[,4:18],use= "Complete.obs")

% calculates the correlation of columns 4th through 18th of the algae dataset, and the parameter use= "Complete.obs" allows R to ignore records containing NA when calculating related values.

Step2:symnum (Cor (algae[,4:18],use = "Complete.obs"))

% converts the result of a numerical form into the form of Figure 2. It is shown from FIG. 2 that PO4 and OPO4 are strongly correlated, and the correlation coefficients are between 0.9~0.95.

Figure 2 Correlation of columns 4th to 18th of algae datasets

STEP3:LM (Po4~opo4,data=algae)

% establish a linear relationship between PO4 and OPO4, as shown in Figure 3, as illustrated by Figure 3: PO4 = 45.602 + 1.278xopo4

Fig. 3 linear relationship between PO4 and oPO4

STEP4: According to PO4 and OPO4 linear relationship, fill PO4 with OPO4.

> Algae<-algae[-manynas (algae),]% delete cases with more missing values in algae data set

> fillpo4<-function (OP) {

+ if (is.na (OP))

+ RETURN (NA)

+ Else return (45.602+1.278*OP)

> algae[is.na (ALGAE$PO4), "PO4"]<-sapply (Algae[is.na (ALGAE$PO4), "OPO4"],fillpo4)

> Algae

(3) based on the similarity between the case to fill

The common function knnimputation () in R uses the similarity of the case (row) to populate the missing value. It finds the nearest K neighbor of any case according to the KNN algorithm, and fills the missing value by setting the function value (usually selecting mean, median, number, etc.) in the nearest neighbor case. Use the following methods:

Algae<-knnimputation (algae,k=10,meth= "median")% selects 10 nearest-neighbor cases and populates the missing values with the median number of these 10 cases.

Note: The collation of the knowledge point is to understand the problem in depth, and the second is to facilitate the follow-up search data. It is my pleasure to be of help to your doubt.
This article references http://www.cnblogs.com/cloudtj/articles/5508367.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Rstudio-methods for dealing with missing values _r language

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Rstudio-methods for dealing with missing values _r language

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support