Real-world data is often incomplete, noisy, and inconsistent. The
data cleaning process includes missing data processing, noisy data processing, and inconsistent data processing. This section introduces the main processing methods of
data cleaning.
Missing data processing
Suppose that when analyzing the sales data of a mall, it is found that the attribute values in multiple records are empty. For example, the customer's income attribute, for the attribute values that are empty, the following methods can be used for missing data processing.
1) Ignore the record
If an attribute value is omitted in a record, the record is excluded, especially when there is no category attribute value and classification data mining is required.
Of course, this method is not very effective, especially when the proportion of records of missing values of each attribute is quite different.
2) Manually fill in missing values
Generally, this method is time-consuming, and it is obviously less feasible for large-scale data sets with many omissions.
3) Use default values to fill in missing values
All missing values of an attribute are filled with a predetermined value, for example, filled with "OK". But when there are many missing values of an attribute, if this method is adopted, it may mislead the mining process.
Therefore, although this method is simple, it is not recommended, or it needs to be carefully analyzed after filling, so as to avoid large errors in the final mining results.
4) Use the mean to fill in missing values
Calculate the average value of an attribute value, and use this value to fill in any missing values of the attribute. For example, if the average income of a customer is 10,000 yuan, this value is used to fill in any missing values in the "customer income" attribute.
5) Use the mean value of the same category to fill in the missing value
This method is especially suitable for classification mining.
For example, if you want to classify the customers in the mall by credit risk, you can use the average value of the "customer income" attribute under the same credit risk category (such as good) to fill in all the "customer income" under the same credit risk category "The missing value of the attribute.
6) Use the most probable value to fill in the missing value
You can use regression analysis, Bayesian calculation formulas or decision trees to infer the maximum possible value of a particular attribute of the record.
For example, using the attribute values of other customers in the data set, a decision tree can be constructed to predict the missing value of the "customer revenue" attribute.
The last method is a more commonly used method. Compared with other methods, it makes the most of the information contained in the current data to help predict the missing data.
Noise data processing
Noise refers to a random error and change of the measured variable. The specific method of smoothing denoising is explained below by giving a numeric attribute (such as price).
1. Bin method
The Bin method smoothes a set of sorted data by using surrounding points (nearest neighbors) of data points that should be smoothed. The sorted data is distributed into several buckets (called Bins).
There are generally two methods for dividing Bin, one is the equal height method, that is, the number of elements in each Bin is equal, and the other is the equal width method, that is, the value interval of each Bin (the difference between the left and right boundaries) )the same.
First, sort the price data, then divide it into several equal-height bins, that is, each bin contains 3 values, and finally, you can use either the average of each bin to smooth or the boundary of each bin Smooth.
When smoothing with the mean, 4, 8, and 15 in the first Bin are replaced with the mean of the Bin, and when smoothing with the boundary, for a given Bin, the maximum and minimum of the Bin constitute the boundary of the Bin, Use the boundary value (maximum or minimum) of each Bin to replace all the values in the Bin.
In general, the wider the width of each Bin, the more obvious the smoothing effect.
2. Cluster analysis method
Cluster analysis method can help to find abnormal data. Similar or adjacent data are aggregated together to form each cluster set, and those data objects that are outside these cluster sets are naturally regarded as abnormal data.
3. Man-machine combined inspection method
Through the combination of man-machine inspection methods, it can help to find abnormal data.
For example, using information theory-based methods can help identify abnormal patterns in the handwritten symbol library. The identified abnormal patterns can be output to a list, and then each person can check each abnormal pattern in this list and finally confirm that it is useless Mode (true abnormal mode).
This man-machine combined inspection method is much more efficient than the manual method of handwritten symbol library inspection.
4. Regression method
You can use the fitting function to smooth the data.
For example, with the help of linear regression methods, including multivariate regression methods, it is possible to obtain the fitting relationship between multiple variables, so as to use one (or a group of) variable values to predict the value of another variable.
The fitting function obtained by the regression analysis method can help smooth the data and remove the noise.
Many data smoothing methods are also data reduction methods. For example, the Bin method described above can help reduce different values in an attribute, which means that the Bin method can be used as a data reduction processing method based on a logic mining method.
Inconsistent data processing
Real-world databases often show the problem of inconsistent data records, and some of these data can use their association with the outside to manually solve this problem.
For example, data entry errors can generally be corrected by comparing with the original. There are also methods to help correct inconsistencies that occur when using coding. Knowledge engineering tools can also help detect violations of data constraints.
Because the name of the same attribute is not standardized in different databases, it often leads to inconsistencies in data integration.