Using mice packet in R language to fill the _r language with linear regression of missing value

Source: Internet
Author: User

In data analysis, we often encounter the problem of missing value. The general missing value of the processing method has the deletion method and the filling method. By deleting the method, we can delete the missing data samples or variables. The missing value filling method can be divided into single variable filling method and multivariable filling method, in which the single variable filling method can be divided into random filling method, median/median filling method, return filling method and so on. This article briefly describes how to use the MICE packet in the R language to return the missing value to fill.

Assuming that the original data has only two columns of P (pressure) and T (temperature), the specific data is as follows:

Orig_data <-data.frame (T = C (0.47, 0.45, 0.48, 0.47, 0.41, 0.56, 0.54, 0.51, 0.44, 0.56, NA, 0.62, 0.5, 0.) 0.69, Na, Na, 0.73, 0.45, 0.43, 0.38, 0.35, 0.5, 0.46, 0.41, 0.43, 0.41, NA, 0.8, 0.51, Na, 0.44, NA , 0.43, 0.45, 0.77, 0.41, 0.77, 0.47, 0.63, 0.43, Na, Na, 0.47, na, 0.25, 0.48, 0.49, 0.46, 0.72, NA, 0. 0.45, Na, 0.41, 0.36, 0.48, 0.4, 0.44, 0.73, 0.8, 0.45, 0.47, 0.54, 0.5, 0.5, 0.48, 0.44, NA, 0.42, 0.34, 0.45, NA, 0.42, 0.42, 0.42, 0.42, 0.52, 0.44, 0.56, NA, 0.52, 0.44, 0.5, Na, 0.46, 0.42, 0.42, 0.3 5, 0.3, NA, 0.49, 0.53, 0.62, 0.48, 0.44, 0.48, 0.48, 0.45, 0.43, 0.43, 0.47, Na, 0.48, 0.69, 0.62, 0.45 , 0.4, NA, 0.9, 0.7, 0.37, 0.66, 0.36, 0.76, 0.83, 0.44, 0.33, 0.46, 0.46, 0.43, 0.45, Na, 0.46, 0.43, 0
        0.52, 0.48, 0.44, 0.37, 0.47, 0.47), P = C (4650, 3720, 2050, 5600, 1420, 5299.6, 6714, 3858, 3731, 3331, NA, 3800, 2190, 2$, Na, NA, 7135, 6817, 2264, 4490, 2359, 889, 3572, 4978, 3800, 1735, 2092, 4200, 6840, 2381, 250, 663 7, NA, 1434, 3122, 11542, 1075, 12075, 5027, 3640, 2026, 4551, Na, 4551, na, 927, 2727, 4400, 925, 10800 , NA, 1894, 1514, 1987, 2741, 2788, 4490, 2375, 4772, 5490, 3190, 4177, 3490, 5660, 5750, 6220, 4345, 39 850, 4300, 2459, 2074, 2450, 3350, 3002, 3350, 3002, 1263, 2969, 827, NA, 5613, 3272, 3360, 2600, 359 9, 653, 2062, 1300, NA, 4439, 4218, 4057, 1242, 4722, 2731, 3100, 2245, 2340, 3387, 2367, Na, 6301, 
        3565, 9500, 9137, 2282, 2521, 11600, 7134, 2684, 4254, 1628, 5400, 6550, 3692, 2200, 980, 980, 1162, 3145,
 NA, 2117, 3390, 4365, 800, 2250, 2915, 2929, 4229, 5830)
The MD function that calls the mice package can see the missing value pattern in the original data. The following table, 1 and 0, is the missing value pattern: 0 indicates that the column in the variable has a missing value, and 1 indicates that there is no missing value. In the original data, the P column is missing 11 data, the T column is missing 19 data, there are 11 data points missing in both columns, and there are 113 data points with no data missing in the two columns. We can also use the Scattmiss () function or the AGGR () function in the VIM package to draw a scatter chart of missing data.

Library (MICE)
Md.pattern (orig_data)
     	P  	T   
113  	1  	1  	0
  8 	 1
 0 1 11 	 0  	0  	2

Library ("VIM")  
Aggr (orig_data, prop = T, numbers = t)
The above code shows the missing value distribution as follows. It can be seen from the graph that the missing ratio of T data columns is approximately 14% (0.14), the missing ratio of P column is about 8.3%, and both columns are missing 8.3% of the total data.

For the 11 data points that are missing from the two data, we cannot use the regression method to fill them. But for a data point with a missing value, we can use the regression filling method to do the missing value filling. Because there is a certain degree of linear correlation between P and T, the two variables. We can make a linear regression between the two variables to see the linear relationship between them. The specific R code is as follows.

Plot (orig_data)
Linear_model <-lm (P ~ T, data = orig_data)
abline (linear_model,col= "red")
Summary ( Linear_model) The
above code output results are as follows. Call
:
lm (formula = P ~ T, data = orig_data)
residuals:
    Min      1Q  Median      3Q     Max 
- 4616.3-1244.2    -2.6   766.6  5905.8 
coefficients:
            estimate Std. Error T value Pr (>|t|)    
(Intercept)  -2651.1      712.2  -3.722 0.000312 * * *            13071.9     1411.4   9.262 1.79e-15 * * *
---
Signif. Codes:  0 ' * * * 0.001 ' * * ' 0.01 ' * ' 0.05 '. ' 0.1 ' 1 residual standard error:1762 on a degrees of freedom (observations deleted due to
  missingness) 
  multiple r-squared:  0.4359,	adjusted r-squared:  0.4308 f-statistic:85.78 
on 1 and DF,  p-value:1.79e-15

The R2 of P and T after linear regression is 0.43. So we can run the following code and use the linear regression equation to fill the missing T value.

#首先加载sqldf包, exclude all missing samples of the data
Library (sqldf)
temp_data <-sqldf ("Select T, P from Orig_data 
              where T isn't Null
              or P is not NULL ", row.names=true)

#利用mice包填补在T列的缺失值
imp <-mice (temp_data,seed=3231)
fit_new <-with (Imp,linear_model)
pooled <-pool (fit_new)
#获得新生成的数据
new_data <-complete (imp,action=3)
#将原始数据和新数据进行并排比较
total_data <-cbind (temp_data,new_data)
colnames (Total_data) <-C ("original _t "," original_p "," new_t "," new_p ")
Total_data The results of the

new and original data comparisons are as follows:
original_t original_p new_t new_p
1 0.47 4650.0 0.47 4650.0
2 0.45 3720.0 0.45 3720.0
3 0.48 2050.0 0.48 2050.0
4 0.47 5600.0 0.47 5600.0
5 0.41 1420.0 0.41 1420.0
6 0.56 5299.6 0.56 5299.6
7 0.54 6714.0 0.54 6714.0
8 0.51 3858.0 0.51 3858.0
9 0.44 3731.0 0.44 3731.0
10 0.56 3331.0 0.56 3331.0
11 0.62 3800.0 0.62 3800.0
12 0.50 2190.0 0.50 2190.0
13 0.43 2800.0 0.43 2800.0
14 0.69 7135.0 0.69 7135.0
15 0.73 6817.0 0.73 6817.0
16 0.45 2264.0 0.45 2264.0
17 0.43 4490.0 0.43 4490.0
18 0.38 2359.0 0.38 2359.0
19 0.35 889.0 0.35 889.0
20 0.50 3572.0 0.50 3572.0
21 0.46 4978.0 0.46 4978.0
22 0.41 3800.0 0.41 3800.0
23 0.43 1735.0 0.43 1735.0
24 0.41 2092.0 0.41 2092.0
NA 4200.0 0.47 4200.0
26 0.80 6840.0 0.80 6840.0
27 0.51 2381.0 0.51 2381.0
NA 250.0 0.35 250.0
29 0.44 6637.0 0.44 6637.0
30 0.43 1434.0 0.43 1434.0
31 0.45 3122.0 0.45 3122.0
32 0.77 11542.0 0.77 11542.0
33 0.41 1075.0 0.41 1075.0
34 0.77 12075.0 0.77 12075.0
35 0.47 5027.0 0.47 5027.0
36 0.63 3640.0 0.63 3640.0
37 0.43 2026.0 0.43 2026.0
NA 4551.0 0.44 4551.0
39 0.47 4551.0 0.47 4551.0
40 0.25 927.0 0.25 927.0
41 0.48 2727.0 0.48 2727.0
42 0.49 4400.0 0.49 4400.0
43 0.46 925.0 0.46 925.0
44 0.72 10800.0 0.72 10800.0
45 0.36 1894.0 0.36 1894.0
NA 1514.0 0.43 1514.0
47 0.45 1987.0 0.45 1987.0
48 0.41 2741.0 0.41 2741.0
49 0.36 2788.0 0.36 2788.0
50 0.48 4490.0 0.48 4490.0
51 0.40 2375.0 0.40 2375.0
52 0.44 4772.0 0.44 4772.0
53 0.73 5490.0 0.73 5490.0
54 0.80 3190.0 0.80 3190.0
55 0.45 4177.0 0.45 4177.0
56 0.47 3490.0 0.47 3490.0
57 0.54 5660.0 0.54 5660.0
58 0.50 5750.0 0.50 5750.0
59 0.50 6220.0 0.50 6220.0
60 0.48 4345.0 0.48 4345.0
61 0.44 3983.0 0.44 3983.0
NA 850.0 0.25 850.0
63 0.42 4300.0 0.42 4300.0
64 0.34 2459.0 0.34 2459.0
65 0.45 2074.0 0.45 2074.0
NA 2450.0 0.48 2450.0
67 0.42 3350.0 0.42 3350.0
68 0.42 3002.0 0.42 3002.0
69 0.42 3350.0 0.42 3350.0
70 0.42 3002.0 0.42 3002.0
71 0.52 1263.0 0.52 1263.0
72 0.44 2969.0 0.44 2969.0
73 0.56 827.0 0.56 827.0
74 0.52 5613.0 0.52 5613.0
75 0.44 3272.0 0.44 3272.0
76 0.50 3360.0 0.50 3360.0
NA 2600.0 0.43 2600.0
78 0.46 3599.0 0.46 3599.0
79 0.42 288.0 0.42 288.0
80 0.42 653.0 0.42 653.0
81 0.35 2062.0 0.35 2062.0
82 0.30 1300.0 0.30 1300.0
83 0.49 4439.0 0.49 4439.0
84 0.53 4218.0 0.53 4218.0
85 0.62 4057.0 0.62 4057.0
86 0.48 1242.0 0.48 1242.0
87 0.44 4722.0 0.44 4722.0
88 0.48 2731.0 0.48 2731.0
89 0.48 3100.0 0.48 3100.0
90 0.45 2245.0 0.45 2245.0
91 0.43 2340.0 0.43 2340.0
92 0.43 3387.0 0.43 3387.0
93 0.47 2367.0 0.47 2367.0
94 0.48 6301.0 0.48 6301.0
95 0.69 3565.0 0.69 3565.0
96 0.62 9500.0 0.62 9500.0
97 0.45 9137.0 0.45 9137.0
98 0.40 2282.0 0.40 2282.0
NA 2521.0 0.36 2521.0
100 0.90 11600.0 0.90 11600.0
101 0.70 7134.0 0.70 7134.0
102 0.37 2684.0 0.37 2684.0
103 0.66 4254.0 0.66 4254.0
104 0.36 1628.0 0.36 1628.0
105 0.76 5400.0 0.76 5400.0
106 0.83 6550.0 0.83 6550.0
107 0.44 3692.0 0.44 3692.0
108 0.33 2200.0 0.33 2200.0
109 0.46 980.0 0.46 980.0
110 0.46 980.0 0.46 980.0
111 0.43 1162.0 0.43 1162.0
112 0.45 3145.0 0.45 3145.0
113 0.46 2117.0 0.46 2117.0
114 0.43 3390.0 0.43 3390.0
115 0.44 4365.0 0.44 4365.0
116 0.52 800.0 0.52 800.0
117 0.48 2250.0 0.48 2250.0
118 0.44 2915.0 0.44 2915.0
119 0.37 2929.0 0.37 2929.0
120 0.47 4229.0 0.47 4229.0
121 0.47 5830.0 0.47 5830.0

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.