"Data Analysis R Language Practice" study notes Chapter III data preprocessing (bottom)

Last Update:2015-05-15 Source: Internet

Author: User

Tags true true vars

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

3.3 Missing value processing

The missing values in R are expressed in Na, and there are two functions to determine if there are missing values in the data, and the most basic function is is.na (), which can be applied to many objects, such as vectors, data frames, and return logical values.

> Attach (data) The following objects is masked fromdata (pos = 3): City, Price, salary> data$salary=replace (Salary,sa Lary>5,na) > Is.na (Salary) [1] Falsefalse True false True True false false falsefalse> sum (is.na (Salary) ) [1] 4

Another function that determines the missing value is complete.cases (), which also returns the logical value vector, but the value is the opposite of Is.na () : The missing value is false, the normal data is true, It is convenient to use it to select rows with no missing data.

> complete.cases (data$salary) [1] True True false true False Falsefalse True true True True True

3.3.2 Judging the missing mode

When there is missing data, we need to further judge the missing mode of the data and decide whether it is random, then we can determine the method of processing.

Package mice, using the chain equation for multivariate interpolation, can deal with the data loss of mixed variable type, automatically generate the Predictor variables of filling variables, is an important tool to deal with missing values.

> Library (MICE) > Data$price=replace (Price,price>5,na) > Md.pattern (data) Price salary City5 1 1 0 13 0 1 0 24 1 0 0 23 4 12 19

The "1" in the output showsno missing data, and "0" indicates that there is missing data. The 1th column, line 1th, "5" indicates that there are 5 samples are complete, the following "3" means that there are 3 samples missing the value of the variable salary, 1th column The last number "4" means 4 Records are missing on both salary and price . The last line represents the total number of samples missing for each variable.

The package Vim provides a new tool to explore the absence of data in R, enabling the visualization of missing patterns

> Library (VIM) > Aggr (data)

The first graph shows the size of the missing data for each variable by the length of the small bar

The second figure shows a comprehensive missing pattern, which can be compared to the results generated by the Md.pattern (), where the light box represents the complete data and the dark box represents the missing value. The height of the bottom color box reflects the frequency of the corresponding combination.

3.3.3 processing missing data

(1) Delete Missing Samples

The simplest way to filter out missing samples is to have fewer data ratios, and the missing data is random, so deleting missing data has little effect on the results. R can use the complete.cases () command to select a complete record, and a row with missing values is deleted.

> data1=data[complete.cases (data$salary),]> Dim (Data1) [1] 8 3

> data2=data[!is.na (Salary),]> Dim (Data2) [1] 8 3

For data with multiple variables missing, if you want to delete all missing values directly, you can do so through the Na.omit () function.

> Data3=na.omit (data) > Dim (data3) [1] 5 3

(2) Replace missing values

> data[is.na (data)]=mean (Salary[!is.na (Salary)))

(3) Multi-interpolation method

Multi-interpolation (multiple imputation) is a method used to fill the missing values of complex data, which predicts missing data through the relationship between variables, generates multiple complete datasets using Monte Carlo Stochastic simulation method, and analyzes these data sets separately. Finally, the results of these analyses are summarized and processed. FSC is an interpolation method based on the chain equation and is therefore also called mice (multiple imputation by Chained equations). The essential difference between it and other multi-interpolation algorithms is that it does not have to take into account the joint distribution of interpolated variables and co-variables, but instead uses the conditional distributions of individual variables to interpolate each one. This method can be implemented in the R language through the function mice () in the mice, which randomly simulates multiple complete datasets and deposits the Imp, then returns the Imp linearly, and finally summarizes the regression results with the pool function.

3.4 Data Collation

3.4.1 Data Merge

(1) function Cbind (), Rbind ()

> a=c ("HK", 12,10) > Data1=rbind (data,a) > Data1cityprice salary ... QA 6 513 HK 12 10

(2) Construction Data.frame

The simplest idea for the data "cosmetic surgery" is to quantify the data and then construct other types of objects as required by the vector. Some structurally similar objects, such as vectors ( numeric, character, logical), factors, numerical matrices, lists, or other data frames, can be combined into a single data frame.

> Weight=c (150,135,210,140) > Height=c (65,61,70,65) > Gender=c ("F", "F", "M", "F") > Stu=data.frame (weight, Height,gender) > Stuweightheight gender1, F2 135, F3, M4

When merging, the variable name becomes the column name of the new data frame, or it can be re-assigned with names () .

> Row.names (stu) =c ("Alice", "Bob", "Cal", "David") > Stuweightheight genderalice fbob 135 1 fcal The F

(3) function merge ()

merging two datasets in R can be accomplished through a specialized function merge () . The merge is identified by the same column or row name, merging two data frames or lists with the following invocation format:

Merge (x, y, by = intersect (names (x), names (y)), by.x = by, By.y = by, all = FALSE, all.x = all, All.y = All,sort =true, Suff Ixes = C (". X", ". Y"),

Incomparables = NULL, ...)

x , y data set to merge

by specifies the basis of the merge ( same row or column)

by.x by.y The column names to concatenate for the first data frame and the second data frame, respectively

All, all.x, all.y logical value, default to False.

> Index=list ("City" =data$city, "index" =1:12) > Index$city[1] "BJ" "sh" "GZ" "AB" "CD" "as" "AC" "FA" "FF" "ee" "er" "QA"  $index [1] 1 2 3 4 5 6 7 8 9 12> data.index=merge (data,index,by= "City") > Data.indexcityprice salary index1 AB 6  5 ac 5 na 3 na 1 2 1212 cd 1 na ee 3 4 107 er 5 3 118 FA 6 1 FF 1 2 910 GZ 5 NA 311 QA 6 5, SH 3 4 2

3.4.2 Selecting a subset of data

In R , select a subset of data with brackets []

> Data[data$salary>6]

3.4.3 Data Sorting

the sort function in R sort () can only be used for simple ordering of vectors, and for datasets that contain multiple variables, it needs to be done with an order directive, which is called in the following format:

Order (..., na.last = TRUE, decreasing =false)

> Order.price=order (Data$price)

> sort.list (Data$price)

The order returns the original position of the numbers after the vector is sorted, and the very relevant instruction is rank (rank), which returns the rank of each number in the entire vector, which can be simply understood as the order of size of each number.

> Rank (data$price)

3.5 The conversion of the long-width format.

>t (data)

3.5.1 kneading data function

R has two kneading data function stacks () and unstack| () , used for conversions between long data formats and wide formats.

Stack () converts a data frame into two columns: one for the data and one for the column name for the data.

Unstack () is the inverse of the stack, and the object being converted consists of two columns, which rearrange the columns of data by the different levels of the factor column, separating them into different columns.

3.5.2 Best companion for rubbing data

Package Reshape2 is a rewritten version of reshape, which is designed for data set shape conversion, and generally users use melt (), Acast (), and Dcast (), but they can "knead" the data into various shapes.

Melt itself means dissolution, decomposition, and its role in a data set is actually split data, its object can be an array, a data frame or a list.

> Library (RESHAPE2) > Data (airquality) > str (airquality) ' Data.frame ': 153obs. of 6 variables: $Ozone: int 41 36 12 Na 8 na ... $ solar.r:int 118 149 313 na Na 299 ... $Wind: num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... $Temp: int .... $Month: int 5 5 5 5 5 5 5 5 5 5 .... $ day:int 1 2 3 4 5 6 7 8 9 ...> Longdata=melt (Airquality,id.vars=c ("Ozone", ' Month ', "month", "Day"), Measure.vars=2:4) > str (longdata) ' Data.frame ': 459obs. of 6 variables: $Ozone: int 8 na ... $Month: int 5 5 5 5 5 5 5 5 5 5 ... $Mont H:int 5 5 5 5 5 5 5 5 5 5 ... $ day:int 1 2 3 4 5 6 7 8 9 ... $variable: Factor w/3 Levels "SOLAR.R", "Wind",..: 1 1 1 1 1 1 1 1 ... $value: num 118 149 313 na Na 299 99 19 194 ...

Use Ggplot2 to display value values in multiple dimensions in a graph

> Library (GGPLOT2) > P=ggplot (Data=longdata,aes (X=ozone,y=value,color=factor (Month)) > P+geom_point (Shape =20,size=4) +facet_wrap (~variable,scales= "free_y") +geom_smooth (Aes (group=1), fill= "gray80")

Like Stack () , melt () also has a function to restore the data: Acast () is used for arrays, dcast () is used for data frames, where the parameter formula is a formula, Each variable on the left becomes a column in the new dataset, and the variable on the right is a factor, and each horizontal row becomes a column in the new dataset, converting the long format data to a short format.

"Data Analysis R Language Practice" study notes Chapter III data preprocessing (bottom)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More