In data analysis, we often encounter the problem of missing value. The general missing value of the processing method has the deletion method and the filling method. By deleting the method, we can delete the missing data samples or variables. The missing value filling method can be divided into single variable filling method and multivariable filling method, in which the single variable filling method can be divided into random filling method, median/median filling method, return filling method and so on. This article briefly describes how to use the MICE packet in the R language to return the missing value to fill.
Assuming that the original data has only two columns of P (pressure) and T (temperature), the specific data is as follows:
Orig_data <-data.frame (T = C (0.47, 0.45, 0.48, 0.47, 0.41, 0.56, 0.54, 0.51, 0.44, 0.56, NA, 0.62, 0.5, 0.) 0.69, Na, Na, 0.73, 0.45, 0.43, 0.38, 0.35, 0.5, 0.46, 0.41, 0.43, 0.41, NA, 0.8, 0.51, Na, 0.44, NA , 0.43, 0.45, 0.77, 0.41, 0.77, 0.47, 0.63, 0.43, Na, Na, 0.47, na, 0.25, 0.48, 0.49, 0.46, 0.72, NA, 0. 0.45, Na, 0.41, 0.36, 0.48, 0.4, 0.44, 0.73, 0.8, 0.45, 0.47, 0.54, 0.5, 0.5, 0.48, 0.44, NA, 0.42, 0.34, 0.45, NA, 0.42, 0.42, 0.42, 0.42, 0.52, 0.44, 0.56, NA, 0.52, 0.44, 0.5, Na, 0.46, 0.42, 0.42, 0.3 5, 0.3, NA, 0.49, 0.53, 0.62, 0.48, 0.44, 0.48, 0.48, 0.45, 0.43, 0.43, 0.47, Na, 0.48, 0.69, 0.62, 0.45 , 0.4, NA, 0.9, 0.7, 0.37, 0.66, 0.36, 0.76, 0.83, 0.44, 0.33, 0.46, 0.46, 0.43, 0.45, Na, 0.46, 0.43, 0
0.52, 0.48, 0.44, 0.37, 0.47, 0.47), P = C (4650, 3720, 2050, 5600, 1420, 5299.6, 6714, 3858, 3731, 3331, NA, 3800, 2190, 2$, Na, NA, 7135, 6817, 2264, 4490, 2359, 889, 3572, 4978, 3800, 1735, 2092, 4200, 6840, 2381, 250, 663 7, NA, 1434, 3122, 11542, 1075, 12075, 5027, 3640, 2026, 4551, Na, 4551, na, 927, 2727, 4400, 925, 10800 , NA, 1894, 1514, 1987, 2741, 2788, 4490, 2375, 4772, 5490, 3190, 4177, 3490, 5660, 5750, 6220, 4345, 39 850, 4300, 2459, 2074, 2450, 3350, 3002, 3350, 3002, 1263, 2969, 827, NA, 5613, 3272, 3360, 2600, 359 9, 653, 2062, 1300, NA, 4439, 4218, 4057, 1242, 4722, 2731, 3100, 2245, 2340, 3387, 2367, Na, 6301,
3565, 9500, 9137, 2282, 2521, 11600, 7134, 2684, 4254, 1628, 5400, 6550, 3692, 2200, 980, 980, 1162, 3145,
NA, 2117, 3390, 4365, 800, 2250, 2915, 2929, 4229, 5830)
The MD function that calls the mice package can see the missing value pattern in the original data. The following table, 1 and 0, is the missing value pattern: 0 indicates that the column in the variable has a missing value, and 1 indicates that there is no missing value. In the original data, the P column is missing 11 data, the T column is missing 19 data, there are 11 data points missing in both columns, and there are 113 data points with no data missing in the two columns. We can also use the Scattmiss () function or the AGGR () function in the VIM package to draw a scatter chart of missing data.
Library (MICE)
Md.pattern (orig_data)
P T
113 1 1 0
8 1
0 1 11 0 0 2
Library ("VIM")
Aggr (orig_data, prop = T, numbers = t)
The above code shows the missing value distribution as follows. It can be seen from the graph that the missing ratio of T data columns is approximately 14% (0.14), the missing ratio of P column is about 8.3%, and both columns are missing 8.3% of the total data.
For the 11 data points that are missing from the two data, we cannot use the regression method to fill them. But for a data point with a missing value, we can use the regression filling method to do the missing value filling. Because there is a certain degree of linear correlation between P and T, the two variables. We can make a linear regression between the two variables to see the linear relationship between them. The specific R code is as follows.
Plot (orig_data)
Linear_model <-lm (P ~ T, data = orig_data)
abline (linear_model,col= "red")
Summary ( Linear_model) The
above code output results are as follows. Call
:
lm (formula = P ~ T, data = orig_data)
residuals:
Min 1Q Median 3Q Max
- 4616.3-1244.2 -2.6 766.6 5905.8
coefficients:
estimate Std. Error T value Pr (>|t|)
(Intercept) -2651.1 712.2 -3.722 0.000312 * * * 13071.9 1411.4 9.262 1.79e-15 * * *
---
Signif. Codes: 0 ' * * * 0.001 ' * * ' 0.01 ' * ' 0.05 '. ' 0.1 ' 1 residual standard error:1762 on a degrees of freedom (observations deleted due to
missingness)
multiple r-squared: 0.4359, adjusted r-squared: 0.4308 f-statistic:85.78
on 1 and DF, p-value:1.79e-15
The R2 of P and T after linear regression is 0.43. So we can run the following code and use the linear regression equation to fill the missing T value.
#首先加载sqldf包, exclude all missing samples of the data
Library (sqldf)
temp_data <-sqldf ("Select T, P from Orig_data
where T isn't Null
or P is not NULL ", row.names=true)
#利用mice包填补在T列的缺失值
imp <-mice (temp_data,seed=3231)
fit_new <-with (Imp,linear_model)
pooled <-pool (fit_new)
#获得新生成的数据
new_data <-complete (imp,action=3)
#将原始数据和新数据进行并排比较
total_data <-cbind (temp_data,new_data)
colnames (Total_data) <-C ("original _t "," original_p "," new_t "," new_p ")
Total_data The results of the
new and original data comparisons are as follows:
original_t original_p new_t new_p
1 0.47 4650.0 0.47 4650.0
2 0.45 3720.0 0.45 3720.0
3 0.48 2050.0 0.48 2050.0
4 0.47 5600.0 0.47 5600.0
5 0.41 1420.0 0.41 1420.0
6 0.56 5299.6 0.56 5299.6
7 0.54 6714.0 0.54 6714.0
8 0.51 3858.0 0.51 3858.0
9 0.44 3731.0 0.44 3731.0
10 0.56 3331.0 0.56 3331.0
11 0.62 3800.0 0.62 3800.0
12 0.50 2190.0 0.50 2190.0
13 0.43 2800.0 0.43 2800.0
14 0.69 7135.0 0.69 7135.0
15 0.73 6817.0 0.73 6817.0
16 0.45 2264.0 0.45 2264.0
17 0.43 4490.0 0.43 4490.0
18 0.38 2359.0 0.38 2359.0
19 0.35 889.0 0.35 889.0
20 0.50 3572.0 0.50 3572.0
21 0.46 4978.0 0.46 4978.0
22 0.41 3800.0 0.41 3800.0
23 0.43 1735.0 0.43 1735.0
24 0.41 2092.0 0.41 2092.0
NA 4200.0 0.47 4200.0
26 0.80 6840.0 0.80 6840.0
27 0.51 2381.0 0.51 2381.0
NA 250.0 0.35 250.0
29 0.44 6637.0 0.44 6637.0
30 0.43 1434.0 0.43 1434.0
31 0.45 3122.0 0.45 3122.0
32 0.77 11542.0 0.77 11542.0
33 0.41 1075.0 0.41 1075.0
34 0.77 12075.0 0.77 12075.0
35 0.47 5027.0 0.47 5027.0
36 0.63 3640.0 0.63 3640.0
37 0.43 2026.0 0.43 2026.0
NA 4551.0 0.44 4551.0
39 0.47 4551.0 0.47 4551.0
40 0.25 927.0 0.25 927.0
41 0.48 2727.0 0.48 2727.0
42 0.49 4400.0 0.49 4400.0
43 0.46 925.0 0.46 925.0
44 0.72 10800.0 0.72 10800.0
45 0.36 1894.0 0.36 1894.0
NA 1514.0 0.43 1514.0
47 0.45 1987.0 0.45 1987.0
48 0.41 2741.0 0.41 2741.0
49 0.36 2788.0 0.36 2788.0
50 0.48 4490.0 0.48 4490.0
51 0.40 2375.0 0.40 2375.0
52 0.44 4772.0 0.44 4772.0
53 0.73 5490.0 0.73 5490.0
54 0.80 3190.0 0.80 3190.0
55 0.45 4177.0 0.45 4177.0
56 0.47 3490.0 0.47 3490.0
57 0.54 5660.0 0.54 5660.0
58 0.50 5750.0 0.50 5750.0
59 0.50 6220.0 0.50 6220.0
60 0.48 4345.0 0.48 4345.0
61 0.44 3983.0 0.44 3983.0
NA 850.0 0.25 850.0
63 0.42 4300.0 0.42 4300.0
64 0.34 2459.0 0.34 2459.0
65 0.45 2074.0 0.45 2074.0
NA 2450.0 0.48 2450.0
67 0.42 3350.0 0.42 3350.0
68 0.42 3002.0 0.42 3002.0
69 0.42 3350.0 0.42 3350.0
70 0.42 3002.0 0.42 3002.0
71 0.52 1263.0 0.52 1263.0
72 0.44 2969.0 0.44 2969.0
73 0.56 827.0 0.56 827.0
74 0.52 5613.0 0.52 5613.0
75 0.44 3272.0 0.44 3272.0
76 0.50 3360.0 0.50 3360.0
NA 2600.0 0.43 2600.0
78 0.46 3599.0 0.46 3599.0
79 0.42 288.0 0.42 288.0
80 0.42 653.0 0.42 653.0
81 0.35 2062.0 0.35 2062.0
82 0.30 1300.0 0.30 1300.0
83 0.49 4439.0 0.49 4439.0
84 0.53 4218.0 0.53 4218.0
85 0.62 4057.0 0.62 4057.0
86 0.48 1242.0 0.48 1242.0
87 0.44 4722.0 0.44 4722.0
88 0.48 2731.0 0.48 2731.0
89 0.48 3100.0 0.48 3100.0
90 0.45 2245.0 0.45 2245.0
91 0.43 2340.0 0.43 2340.0
92 0.43 3387.0 0.43 3387.0
93 0.47 2367.0 0.47 2367.0
94 0.48 6301.0 0.48 6301.0
95 0.69 3565.0 0.69 3565.0
96 0.62 9500.0 0.62 9500.0
97 0.45 9137.0 0.45 9137.0
98 0.40 2282.0 0.40 2282.0
NA 2521.0 0.36 2521.0
100 0.90 11600.0 0.90 11600.0
101 0.70 7134.0 0.70 7134.0
102 0.37 2684.0 0.37 2684.0
103 0.66 4254.0 0.66 4254.0
104 0.36 1628.0 0.36 1628.0
105 0.76 5400.0 0.76 5400.0
106 0.83 6550.0 0.83 6550.0
107 0.44 3692.0 0.44 3692.0
108 0.33 2200.0 0.33 2200.0
109 0.46 980.0 0.46 980.0
110 0.46 980.0 0.46 980.0
111 0.43 1162.0 0.43 1162.0
112 0.45 3145.0 0.45 3145.0
113 0.46 2117.0 0.46 2117.0
114 0.43 3390.0 0.43 3390.0
115 0.44 4365.0 0.44 4365.0
116 0.52 800.0 0.52 800.0
117 0.48 2250.0 0.48 2250.0
118 0.44 2915.0 0.44 2915.0
119 0.37 2929.0 0.37 2929.0
120 0.47 4229.0 0.47 4229.0
121 0.47 5830.0 0.47 5830.0