How to use the R language to solve nasty dirty data

Source: Internet
Author: User

Transferred from: http://shujuren.org/article/45.html

In the process of data analysis, the most headache should be how to deal with dirty data, the existence of dirty data will be in the later modeling, mining and other work caused serious errors, so the dirty data must be handled with caution.

The existence form of dirty data mainly has the following situation:

1) Missing value

2) Exception value

3) inconsistencies in data

Here's how to deal with this dirty data with everyone regaling.

First, missing values

Missing values, as the name implies is a kind of data omission, according to the common missing values in CRM to do a summary:

1) Missing member information, such as ID number, mobile phone number, gender, age, etc.

2) Lack of consumption data, such as consumption times, consumption amount, guest unit price, Kayu, etc.

3) Product information is missing, such as batch, price, discount, category, etc.

According to the actual business needs, the missing values can be different processing methods, such as the need to give members to push text messages, and some members of the mobile phone number does not exist, you can consider culling, such as gender does not know, you can use the majority substitution, such as the age is unknown, you can consider the mean replacement. There are, of course, other ways to deal with missing values, such as multiple interpolation methods. The following is a simple example to illustrate the handling of missing values.

#模拟一批含缺失值的数据集

Set.seed (1234)
Tel <-13,812,341,000:13,812,341,999
Sex <-Sample (C (' F ', ' M '), size = $, replace = T, prob = C (0.4,0.6))
Age <-Round (runif (n = +, Min = +, max = 60))
Freq <-Round (runif (n = +, Min = 1, max = 368))
Amount <-rnorm (n = $, mean = 134, SD = 10)
ATV <-runif (n = +, Min = up, max = 138)
DF <-data.table (tel = tel, sex = sex, age = age, Freq = Freq, amount= Amount, ATV = ATV)

The above data frame is a dataset that does not contain any missing values, and now I want to randomly generate 100 missing values, as follows:

View a summary of the original data set

Summary (DF)

Random parameter subscript for a column of a row

Set.seed (1234)

I <-sample (1:6, size = +, replace = T)

J <-Sample (1:1000, size = 100)

Combine subscript into a matrix

Index <-As.matrix (Data.frame (j,i))

Convert the original data frame to a matrix

DF <-As.matrix (DF)

Assigns the row and column of a random parameter to NA

Df[index] <-NA

To convert a matrix to a data frame

DF2 <-as.data.frame (DF)

Transform variable type

Df2$age <-As.integer (df2$age)

Df2$freq <-As.integer (df2$freq)

Df2$amount <-as.numeric (df2$amount)

DF2$ATV <-as.numeric (DF2$ATV)

View the data frame summary after assigning the missing value again

Summary (DF2)

It's obvious that 100 missing values have been randomly generated here, so let's look at the distribution of these 100 missing values. We use the AGGR () function in the VIM package to plot the distribution of missing values:

Library (VIM)

Aggr (DF2, prop = FALSE, numbers = TRUE)

The figure shows: Tel variable has 21 missing, sex variable has 28 missing, age variable has 6 missing, freq variable has 20 missing, amount variable has 13 missing, ATV has 12 missing.

For demonstration purposes, the following observations of the missing tel variable are removed; The missing value of the sex variable is replaced by the majority; The age variable is replaced with an average, and the freq variable, the amount variable, and the ATV variable are populated with multiple interpolation methods.

Deletion of missing Tel variable observations

Df3 <-df2[is.na (Df2$tel) ==false,]

Replace gender and age-gender numbers with the majority and mean values, respectively

Sex_mode <-names (Which.max (table (df3$sex)))

Mean value of age

Age_mean <-mean (df3$age, na.rm = TRUE)

Library (Tidyr)

Df3 <-replace_na (df3,replace = list (Sex = Sex_mode, age = Age_mean))

Summary (DF3)

At this point, the tel variable, the sex variable, and the age variable no longer have missing values, and the freq variable, the amount variable, and the ATV variable are used as multiple interpolation methods.

The multi-interpolation method can be realized through mice package, which can interpolate numerical data and factor data.

For numeric data, the random regression Fill method (PMM) is used by default, the logistic regression Fill method (Logreg) is used by default for two-factor data, and the categorical regression Fill method (Polyreg) is used by default for multivariate data.

Other interpolation methods, you can view the relevant documents through the mice.

Library (MICE)

5 multiple interpolation for missing values section, using the random regression Fill method (PMM) by default

IMP <-mice (data = df3, M = 5)

Take a look at the results of the interpolation

Imp$imp

Calculates the mean value of a 5-interpolated value

Freq_imp <-Apply (Imp$imp$freq,1,mean)

Amount_imp <-Apply (Imp$imp$amount,1,mean)

Atv_imp <-Apply (Imp$imp$atv,1,mean)

Replace the original missing value with the mean value

Df3$freq[is.na (df3$freq)] <-freq_imp

Df3$amount[is.na (Df3$amount)] <-amount_imp

Df3$atv[is.na (DF3$ATV)] <-atv_imp

Review the data set and the original dataset overview after filling the missing values again

Summary (DF3)

Summary (DF2)

The missing value data is processed by different methods, and it is concluded that the general overview of the data is similar to the original data after filling, and the overall characteristics of the data are basically maintained during the filling process.

Second, the abnormal value

Outliers are also very hated a kind of dirty data, outliers tend to pull up or pull down the overall situation of the data, in order to overcome the impact of outliers, we need to deal with outliers. First, we need to identify which values are outliers or outliers, and then how to handle these outliers. The following is still in the form of cases to tell you about the handling of outliers:

1. Identify outliers

In general, by drawing a box plot to see which points are outliers, outliers are judged on the basis of four-cent and four-bit distances.

That is, the outliers exceed the four-digit 1.5 times-fold four-bit distance or below the four-digit 1.5 times-point distance.

Example:

Randomly generate a set of data

Set.seed (1234)

Value <-C (rnorm (mean = ten, SD = 3), runif (min = 0.01, max
=), rf (DF1 = 5, DF2 = 20))

Draw the box plot and mark the outliers with red squares

Library (GGPLOT2)

Ggplot (data = NULL, mapping = AES (x = ", y = value)) + Geom_boxplot (outlier.colour
= ' Red ', Outlier.shape = 1.2, Width =

The figure shows that some of the data falls on the four-digit 1.5 times-fold four-bit distance, that is, outliers, the following programming, the outliers are found:

Calculates the four-bit, four-bit, and four-point distance

QL <-quantile (value, probs = 0.25)

QU <-quantile (value, probs = 0.75)

QU_QL <-QU-QL

QL; QU; Qu_ql

2. Find out the anomaly

which (value > QU + 1.5*qu_ql)

Value[which (Value > QU + 1.5*qu_ql)]

The results show that the 118 points are 104th, 106, 110, 114, 116, 120 and 6 respectively. To deal with these outliers, there are generally two ways to eliminate or substitute. Culling is simple, but sometimes culling can lead to incorrect results for subsequent analyses, and the next step is to talk about substitutes.

Replace with the nearest point of the anomaly.

test01 <-Value

OUT_IMP01 <-Max (Test01[which (test01 <= QU + 1.5*qu_ql)])

Test01[which (test01 > QU + 1.5*qu_ql)] <-out_imp01

Replace the four-digit 1.5 times-fold four-bit or lower four-digit 1.5 times-bit-spacing

test02 <-Value

OUT_IMP02 <-QU + 1.5*qu_ql

Test02[which (test02 > QU + 1.5*qu_ql)] <-out_imp02

Compare and replace data at a glance

Summary (value)

Summary (TEST01)

Summary (test02)

Third, the inconsistency of data

The inconsistency of data is generally due to different data sources, such as some data source data unit is catty, and some data source data units are kilograms, such as some data source data units are meters, and some data source data units are centimeters, such as two data source data is not updated at the same time. For this inconsistency can be easily obtained by data transformation consistent data, only the data source of data consistent, can be statistical analysis or data mining. As this kind of problem is relatively simple to deal with, here is not tired of the specific treatment method.

How to use the R language to solve nasty dirty data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.