Four processing methods for missing data values

Last Update:2014-09-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Causes of missing values

There are various causes for missing values, mainly for mechanical and human reasons. The mechanical cause is the failure of data collection or storage due to mechanical reasons, such as data storage failure and storage damage, due to a mechanical failure, data cannot be collected for a certain period of time (for scheduled data collection ). The artificial cause is the lack of data due to subjective mistakes, historical limitations, or intentional concealment. For example, in a market survey, the visitor refused to give answers to the questions, or the answer is invalid, and the data entry personnel missed the data.

Ii. Types of missing values

Missing values can be divided into completely random missing, random missing, and completely non-random missing from the distribution of missing values. Missing completely at random (mcar) indicates that the missing data is random, and the missing data does not depend on any incomplete or complete variables. Missing at random (MAR) indicates that the missing data is not completely random, that is, the missing data depends on other full variables. Missing Not at random (mnar) indicates that the missing data depends on the incomplete variable itself.

From the perspective of the attribute to which the missing value belongs, if all the missing values belong to the same attribute, such missing values become missing single values. If the missing values belong to different attributes, they are called arbitrary missing values. In addition, time series data may be missing over time, which is called monotonic loss.

Iii. Handling of missing values

The processing of missing values is generally divided into deleting cases with missing values and interpolation of missing values. For subjective data, the authenticity of the data is affected, and the actual values of other attributes of the samples with missing values cannot be guaranteed. Therefore, interpolation based on these attribute values is unreliable, therefore, interpolation is not recommended for subjective data. Interpolation mainly targets objective data and ensures its reliability.

1. Delete cases with missing values

There are mainly simple deletion and weight methods. Simple deletion is the most primitive method for processing missing values. It deletes a case with missing values. If the data is missing, you can simply delete a small number of samples to achieve the goal. This method is the most effective. When the missing value type is not completely random, the deviation can be reduced by weighting the complete data. After the case is labeled with incomplete data, the complete data case is assigned different weights. the weights of the case can be obtained through logistic or probit regression. If there is a variable in the interpreted variable that determines the line factor from weight estimation, this method can effectively reduce the deviation. If the explanatory variable and weight are irrelevant, it cannot reduce the deviation. If multiple attributes are missing, you need to assign different weights to the missing combinations of different attributes, which greatly increases the difficulty of calculation and reduces the accuracy of prediction, the weight method is not ideal.

2. Possible Value Interpolation Missing Value

Its idea source is to use the most possible value to insert missing values less than the loss of information produced by deleting incomplete samples. In data mining, a large database usually has dozens or even hundreds of attributes. A large number of other attribute values are discarded because of the missing attribute values, this kind of deletion is a great waste of information, so the idea and method of Interpolation of missing values with possible values is generated. The following methods are commonly used.

(1) mean interpolation. Data attributes are classified into fixed-distance type and non-fixed-distance type. If the missing value is of the Fixed Distance type, the missing value is interpolated based on the average value of the existing value of this attribute. If the missing value is of a non-fixed Distance type, the mode principle in statistics is used, use the mode of this attribute (that is, the value with the highest frequency) to fill in the missing value.

(2) Use similar mean interpolation. The same mean interpolation method belongs to single-Value Interpolation. The difference is that it uses a hierarchical clustering model to predict the type of missing variables, and then uses this type of mean interpolation. Suppose X = (x1, x2... XP) is a fully-informative variable, and Y is a variable with missing values. First, clustering is performed on X or its subset rows, and then the mean values of different classes are interpolated Based on the classes of missing cases. If we still need to analyze the introduced explanatory variables and Y in the future statistical analysis, this interpolation method will introduce auto-correlation into the model and cause obstacles to the analysis.

(3) Maximum Likelihood Estimation(Max likelihood, ml ). If the missing type is random, assuming that the model is correct for the complete sample, the marginal distribution of the observed data can be used to estimate the maximum likelihood of unknown parameters (little and Rubin ). This method is also known as ignoring the maximum likelihood estimation of missing values. In practice, the calculation method commonly used for the maximum likelihood parameter estimation is the maximum expectation (EM ). This method is more attractive than case deletion and single-Value Interpolation. It is an important premise: suitable for large samples. The number of valid samples is sufficient to ensure that the ML estimation value is progressively unbiased and follows a normal distribution. However, this method may fall into the local extreme values, the convergence speed is not very fast, and the calculation is very complicated.

(4) Multiple Interpolation(Multiple imputation, mi ). The idea of multi-Value Interpolation comes from Bayesian estimation. We think that the value to be interpolated is random, and its value comes from the observed value. In practice, we usually estimate the value to be interpolated and add different noises to form multiple sets of optional interpolation values. Select the most suitable interpolation value based on some selection criteria.

Multiple interpolation methods are divided into three steps: ① generate a set of possible interpolation values for each null value, which reflects the uncertainty of the non-response model; each value can be used to interpolation the missing values in the dataset, generates several complete data sets. ② Each interpolation data set uses a statistical method for the complete dataset for statistical analysis. ③ Select the result from each interpolation dataset Based on the scoring function to generate the final interpolation value.

Assume that a group of data includes three variables Y1, Y2, and Y3. Their joint distribution is a normal distribution, which is processed into three groups. Group A maintains the original data, group B only lacks Y3, group C lacks Y1 and Y2. During multi-value insertion and completion, group A will not be processed, and a group of Y3 estimates will be generated for Group B (for the regression of Y3 on Y1 and Y2 ), for group C, a component of Y1 and Y2 is generated to estimate the value (for regression of Y1, Y2 on Y3 ).

When multi-value interpolation is used, group A will not be processed. Group B and Group C will be randomly extracted from the complete samples to form a group M (M is an optional M group interpolation value ), you only need to effectively estimate the number of cases in each group. Estimate the distribution of attributes with missing values. Then, based on the M group observations, the M group samples generate the M group estimates for the parameters respectively, and give the corresponding predictions, at this time, the estimation method is the maximum likelihood method, and the specific implementation algorithm in the computer is the expectation maximization method (EM ). A group of Y3 values is estimated for Group B. on the premise that the combined distribution of Y1, Y2, and Y3 is normal distribution, a group (Y1, y2 ).

In the preceding example, the Joint Distribution of Y1, Y2, and Y3 is assumed to be a normal distribution. This assumption is artificial, but the variables that have passed validation (Graham and Schafer are at 1999) and are not normally associated with distribution can still be estimated to be close to the actual value under this assumption.

Multiple interpolation is consistent with Bayesian estimation, but multiple interpolation makes up for several shortcomings of Bayesian estimation.

(1) Bayesian estimation uses the maximum likelihood method. The maximum likelihood method requires that the model form be accurate. If the parameter form is incorrect, the conclusion is false, that is, prior distribution affects the accuracy of posterior distribution. The multi-interpolation is based on the theory of the data that is almost complete for large samples. The data volume in Data Mining is large, and the prior distribution will have a very small impact on the results, therefore, the prior distribution has little impact on the results.

(2) Bayesian estimation only requires that the prior distribution of unknown parameters be known without the relationship with parameters. Multiple interpolation estimates the Joint Distribution of parameters and uses the relationship between parameters.

The above four interpolation methods have good effects on interpolation with missing values of the random missing values. The two mean interpolation methods are the easiest to implement and are often used in the past. However, they have great interference with samples, especially when the value after interpolation is used as an interpreted variable for regression, the estimated value of the parameter is significantly different from the actual value. In comparison, maximum likelihood estimation and multiple interpolation are two better interpolation methods. Compared with multiple interpolation, the maximum likelihood lacks uncertain components. Therefore, more and more people prefer multi-value interpolation methods.

Iv. Summary

Interpolation only adds unknown values to our subjective estimates, which is not necessarily in line with objective facts. The above analysis is a theoretical analysis. Because the missing value itself cannot be observed, it is impossible to know the type of its missing value, and it is impossible to estimate the interpolation effect of an interpolation method. In addition, these methods are used in various fields and are universal. Therefore, the professional interpolation effect for one field is not ideal. It is precisely for this reason that, many professional data mining personnel, through their understanding of the industry, manually Interpolation of missing values may be better than these methods. In the process of data mining, the interpolation of missing values does not discard a large amount of information, but Manual Interference does not affect the relationship between variables, while completing incomplete information, we have more or less changed the original data information system, which has potential impact on future analysis, therefore, you must be careful when dealing with missing values.

Four processing methods for missing data values

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Four processing methods for missing data values

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Four processing methods for missing data values

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support