Transferred from: http://shujuren.org/article/45.html
In the process of data analysis, the most headache should be how to deal with dirty data, the existence of dirty data will be in the later modeling, mining and other work caused serious errors, so the dirty data must be handled with caution.
The existence form of dirty data mainly has the following situation:
1) Missing value
2) Exception value
3) inconsistencies in data
Here's how to deal with this dirty data with everyone regaling.
First, missing values
Missing values, as the name implies is a kind of data omission, according to the common missing values in CRM to do a summary:
1) Missing member information, such as ID number, mobile phone number, gender, age, etc.
2) Lack of consumption data, such as consumption times, consumption amount, guest unit price, Kayu, etc.
3) Product information is missing, such as batch, price, discount, category, etc.
According to the actual business needs, the missing values can be different processing methods, such as the need to give members to push text messages, and some members of the mobile phone number does not exist, you can consider culling, such as gender does not know, you can use the majority substitution, such as the age is unknown, you can consider the mean replacement. There are, of course, other ways to deal with missing values, such as multiple interpolation methods. The following is a simple example to illustrate the handling of missing values.
#模拟一批含缺失值的数据集
Set.seed (1234)
Tel <-13,812,341,000:13,812,341,999
Sex <-Sample (C (' F ', ' M '), size = $, replace = T, prob = C (0.4,0.6))
Age <-Round (runif (n = +, Min = +, max = 60))
Freq <-Round (runif (n = +, Min = 1, max = 368))
Amount <-rnorm (n = $, mean = 134, SD = 10)
ATV <-runif (n = +, Min = up, max = 138)
DF <-data.table (tel = tel, sex = sex, age = age, Freq = Freq, amount= Amount, ATV = ATV)
The above data frame is a dataset that does not contain any missing values, and now I want to randomly generate 100 missing values, as follows:
View a summary of the original data set
Summary (DF)
Random parameter subscript for a column of a row
Set.seed (1234)
I <-sample (1:6, size = +, replace = T)
J <-Sample (1:1000, size = 100)
Combine subscript into a matrix
Index <-As.matrix (Data.frame (j,i))
Convert the original data frame to a matrix
DF <-As.matrix (DF)
Assigns the row and column of a random parameter to NA
Df[index] <-NA
To convert a matrix to a data frame
DF2 <-as.data.frame (DF)
Transform variable type
Df2$age <-As.integer (df2$age)
Df2$freq <-As.integer (df2$freq)
Df2$amount <-as.numeric (df2$amount)
DF2$ATV <-as.numeric (DF2$ATV)
View the data frame summary after assigning the missing value again
Summary (DF2)
It's obvious that 100 missing values have been randomly generated here, so let's look at the distribution of these 100 missing values. We use the AGGR () function in the VIM package to plot the distribution of missing values:
Library (VIM)
Aggr (DF2, prop = FALSE, numbers = TRUE)
The figure shows: Tel variable has 21 missing, sex variable has 28 missing, age variable has 6 missing, freq variable has 20 missing, amount variable has 13 missing, ATV has 12 missing.
For demonstration purposes, the following observations of the missing tel variable are removed; The missing value of the sex variable is replaced by the majority; The age variable is replaced with an average, and the freq variable, the amount variable, and the ATV variable are populated with multiple interpolation methods.
Deletion of missing Tel variable observations
Df3 <-df2[is.na (Df2$tel) ==false,]
Replace gender and age-gender numbers with the majority and mean values, respectively
Sex_mode <-names (Which.max (table (df3$sex)))
Mean value of age
Age_mean <-mean (df3$age, na.rm = TRUE)
Library (Tidyr)
Df3 <-replace_na (df3,replace = list (Sex = Sex_mode, age = Age_mean))
Summary (DF3)
At this point, the tel variable, the sex variable, and the age variable no longer have missing values, and the freq variable, the amount variable, and the ATV variable are used as multiple interpolation methods.
The multi-interpolation method can be realized through mice package, which can interpolate numerical data and factor data.
For numeric data, the random regression Fill method (PMM) is used by default, the logistic regression Fill method (Logreg) is used by default for two-factor data, and the categorical regression Fill method (Polyreg) is used by default for multivariate data.
Other interpolation methods, you can view the relevant documents through the mice.
Library (MICE)
5 multiple interpolation for missing values section, using the random regression Fill method (PMM) by default
IMP <-mice (data = df3, M = 5)
Take a look at the results of the interpolation
Imp$imp
Calculates the mean value of a 5-interpolated value
Freq_imp <-Apply (Imp$imp$freq,1,mean)
Amount_imp <-Apply (Imp$imp$amount,1,mean)
Atv_imp <-Apply (Imp$imp$atv,1,mean)
Replace the original missing value with the mean value
Df3$freq[is.na (df3$freq)] <-freq_imp
Df3$amount[is.na (Df3$amount)] <-amount_imp
Df3$atv[is.na (DF3$ATV)] <-atv_imp
Review the data set and the original dataset overview after filling the missing values again
Summary (DF3)
Summary (DF2)
The missing value data is processed by different methods, and it is concluded that the general overview of the data is similar to the original data after filling, and the overall characteristics of the data are basically maintained during the filling process.
Second, the abnormal value
Outliers are also very hated a kind of dirty data, outliers tend to pull up or pull down the overall situation of the data, in order to overcome the impact of outliers, we need to deal with outliers. First, we need to identify which values are outliers or outliers, and then how to handle these outliers. The following is still in the form of cases to tell you about the handling of outliers:
1. Identify outliers
In general, by drawing a box plot to see which points are outliers, outliers are judged on the basis of four-cent and four-bit distances.
That is, the outliers exceed the four-digit 1.5 times-fold four-bit distance or below the four-digit 1.5 times-point distance.
Example:
Randomly generate a set of data
Set.seed (1234)
Value <-C (rnorm (mean = ten, SD = 3), runif (min = 0.01, max
=), rf (DF1 = 5, DF2 = 20))
Draw the box plot and mark the outliers with red squares
Library (GGPLOT2)
Ggplot (data = NULL, mapping = AES (x = ", y = value)) + Geom_boxplot (outlier.colour
= ' Red ', Outlier.shape = 1.2, Width =
The figure shows that some of the data falls on the four-digit 1.5 times-fold four-bit distance, that is, outliers, the following programming, the outliers are found:
Calculates the four-bit, four-bit, and four-point distance
QL <-quantile (value, probs = 0.25)
QU <-quantile (value, probs = 0.75)
QU_QL <-QU-QL
QL; QU; Qu_ql
2. Find out the anomaly
which (value > QU + 1.5*qu_ql)
Value[which (Value > QU + 1.5*qu_ql)]
The results show that the 118 points are 104th, 106, 110, 114, 116, 120 and 6 respectively. To deal with these outliers, there are generally two ways to eliminate or substitute. Culling is simple, but sometimes culling can lead to incorrect results for subsequent analyses, and the next step is to talk about substitutes.
Replace with the nearest point of the anomaly.
test01 <-Value
OUT_IMP01 <-Max (Test01[which (test01 <= QU + 1.5*qu_ql)])
Test01[which (test01 > QU + 1.5*qu_ql)] <-out_imp01
Replace the four-digit 1.5 times-fold four-bit or lower four-digit 1.5 times-bit-spacing
test02 <-Value
OUT_IMP02 <-QU + 1.5*qu_ql
Test02[which (test02 > QU + 1.5*qu_ql)] <-out_imp02
Compare and replace data at a glance
Summary (value)
Summary (TEST01)
Summary (test02)
Third, the inconsistency of data
The inconsistency of data is generally due to different data sources, such as some data source data unit is catty, and some data source data units are kilograms, such as some data source data units are meters, and some data source data units are centimeters, such as two data source data is not updated at the same time. For this inconsistency can be easily obtained by data transformation consistent data, only the data source of data consistent, can be statistical analysis or data mining. As this kind of problem is relatively simple to deal with, here is not tired of the specific treatment method.
How to use the R language to solve nasty dirty data