Missing Value
1. Is.na true Value position judgment
Note : Missing values are considered to be non-comparable, even if compared to the missing values themselves. This means that the comparison operation cannot be used
Character to detect if a missing value exists. For example, the result of a logical test MyVar = = Na will never be true. As
Instead, you can only use functions that deal with missing values (as described in this section) to identify gaps in R data Objects
Values are lost. 2. Na.omit () Delete incomplete observations
Manynas
Manynas (data, Norp = 0.2)
Arguments
Data
A data frame with the data set.
Norp
A number controlling when a row was considered to has too many NA values (defaults to 0.2, i.e. 20% of the columns). If no rows satisfy the constraint indicated by the user, a warning is generated.
Judgment is missing by proportion. 3. knnimputation k Nearest neighbor fills
Library (DMWR)
knnimputation (data, k = ten, scale = T, meth = "Weighavg", Distdata = NULL)
Arguments
Arguments |
|
data |
A Data frame with the data set |
k |
The number of nearest neighbours to use (defaults to) |
scale |
Boolean Setting If the data should is scale before finding the nearest neighbours (defaults to T) |
meth |
String indicating the method used to calculate the value into fill in each NA. Available values is ' median ' or ' weighavg ' (the default). |
distdata |
Optionally you could sepecify here a data frame containing the data set that Sho Uld is used to find the neighbours. This was usefull when filling in NA values on a test set, where you should use only information from the training set. This defaults to NULL, which means that the neighbours would be is searched in Data |
Details
This function uses the K-nearest neighbours to fill in the Unknown (NA) of values in a data set. For each case with any NA value it'll search for its K most similar cases and use the values of these cases to fill in t He unknowns.
If meth= ' median ' the function would use either the median (in case of numeric variables) or the most frequent value (in CAs E of factors), of the neighbours to fill in the NAs. If meth= ' Weighavg ' the function would use a weighted average of the values of the neighbours. The weights was given by exp (-dist (k,x) where Dist (k,x) was the Euclidean distance between the case with NAs (x) and the NE Ighbour K
Example:
#首先读入程序包并对数据进行清理
Library (DMWR)
data (algae)
algae <-algae[-manynas (algae),]
> Head (clean.algae)
season size speed mxph mnO2 Cl NO3 NH4 oPO4 PO4 CHLA A1
1 Winter Small medium 8.00 9.8 60.800 6.238 578.000 105.000 170.000 50.0 0.0
2 Spring Small Mediu M 8.35 8.0 57.750 1.288 370.000 428.750 558.750 1.3 1.4
3 Autumn Small medium 8.10 11.4 40.020
5.330 346.667 125.667 187.057 15.6 3.3
4 Spring Small medium 8.07 4.8 77.364 2.302 98.182< c25/>61.182 138.700 1.4 3.1
5 Autumn Small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.580 10.5 9.2
6 Winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.667 28.4 15.1
4. Centralimputation () center interpolation
Interpolation of missing data with the median of non-missing samples (median)
Data (algae)
cleanalgae <-centralimputation (algae)
Summary (cleanalgae)
5. Complete.cases () Looking for the full data set
X <-airquality[,-1] # x is a regression design matrix
y <-airquality[, 1] # y is the corresponding respon Se
#验证是否complete. Cases results As with is.na
Stopifnot (complete.cases (y)! = is.na (y))
#x, y common non-missing row bool result
OK <-complete.cases (x, y)
#共有几个缺失样本
sum (!ok) # How many is not "OK"?
#得到非缺失样本
x <-x[ok,]
y <-Y[ok]
6. Na.fail () whether there are missing values
DF <-data.frame (x = C (1, 2, 3), y = C (0, ten, NA))
na.fail (DF)
Error in Na.fail.default (DF): There is a missing value in the object