Data cleaning refers to the last procedure to discover and correct identifiable errors in data files, including checking data consistency, handling invalid and missing values, etc. Let's have a look at the common methods of data lceaning.
1. Exploratory analysis
The exploratory analysis part, for the entire data, is to obtain a preliminary understanding of the data and an exploratory analysis process of a priori knowledge. In the process of doing relevant data mining, I mainly use Python related scientific computing libraries for data Preliminary exploration, such as data type, missing value, data set size, data distribution under each feature, etc., and use a third-party drawing library to visually observe to obtain the basic attributes and distribution of data. In addition, through univariate Analysis and multivariate analysis can initially explore the relationship between the features in the data set to verify the hypothesis proposed in the business analysis stage.
2. Missing values
The method of obtaining missing values in the data set can be directly obtained through a variety of methods provided by pandas. Missing values will generally exist in most data sets. Therefore, the handling of missing values will directly affect the final result of the model. How to deal with missing values mainly depends on the importance of the attribute where the missing value is located and the distribution of the missing values.
①. When the missing rate is low and the attribute importance is low, if the attribute is numeric data, it can be simply filled according to the data distribution. For example: if the data is evenly distributed, the average value can be used to fill the data; if the data distribution Tilt, just use the median fill. If the attribute is a category attribute, it can be filled with a global constant ‘Unknow’, but this is often ineffective, because the algorithm may recognize it as a brand new category, so it is rarely used.
②. When the deletion rate is high (> 95%) and the importance of the attribute is low, you can simply delete the attribute. However, when the missing value is high and the degree of the attribute is high, directly deleting the attribute will cause a very bad influence on the result of the algorithm.
③. High missing value and high attribute importance: the main methods used are interpolation and modeling
(1) Interpolation methods mainly include random interpolation method, multiple interpolation method, hot platform interpolation method, Lagrange interpolation method and Newton interpolation method
1> Random interpolation method-randomly select some samples from the population to replace missing samples
2> Multiple interpolation method-predict the missing data through the relationship between variables, use the Monte Carlo method to generate multiple complete data sets, analyze these data sets, and finally summarize the analysis results
3> Hot platform interpolation-refers to finding a sample (matching sample) similar to the sample where the missing value is located in the non-missing data set, and using the observed value to interpolate the missing value.
Advantages: simple and easy to use, high accuracy
Disadvantages: When the number of variables is large, it is usually difficult to find the exact same sample as the sample to be imputed. But we can stratify the data according to certain variables and apply mean interpolation to the missing values in the layer
4> Lagrange difference method and Newton interpolation method
(2) Modeling method
You can use regression, Bayesian, random forest, decision tree and other models to predict missing data. For example: using the attributes of other data in the dataset, a decision tree can be constructed to predict the value of missing values.
Generally speaking, there is no unified process for the processing of missing data. The method must be selected according to the actual data distribution, the degree of inclination, and the proportion of missing values. In the process of data preprocessing, in addition to the simple filling method and deletion, the modeling method is often used to fill, mainly because the modeling method predicts the unknown value based on the existing value, and the accuracy is higher. . However, the modeling method may also cause the correlation between attributes to become larger, which may affect the training of the final model.
3. Outliers (outliers)
In addition to visual analysis (general box plots) for judging outliers, there are many methods based on statistical background, and visual observation is not suitable for situations with large amounts of data.
3.1 Simple statistical analysis
This step is completed in EDA, only need to use the describe method of pandas to achieve, through the descriptive statistics of the data set, whether there are unreasonable values, that is, outliers
3.2 3∂ principle--outlier detection based on normal distribution
If the data follow a normal distribution, under the principle of 3∂, the abnormal value is a value whose deviation from the average value in a group of measured values exceeds 3 times the standard deviation. If the data follows a normal distribution, the probability of occurrence of values other than 3∂ from the mean value is P (| x-u |> 3∂) <= 0.003, which is a very small probability event. If the data does not follow the normal distribution, it can also be described by how many standard deviations away from the average.
3.3 Model-based detection
First establish a data model. Anomalies are those objects that cannot be perfectly fitted with the model; if the model is a collection of clusters, the anomalies are objects that do not significantly belong to any cluster; when using a regression model, the anomalies are objects that are relatively far from the predicted value
3.4 Based on distance
By defining proximity measures between objects, anomalous objects are those that are far away from other objects
Advantages: simple and easy to operate
Disadvantages: The time complexity is O (m ^ 2), which is not suitable for large data sets. The parameter selection is more sensitive and cannot handle data sets with different density areas because it uses global thresholds and cannot consider this change in density.
3.5 Based on density
A point is classified as an outlier when its local density is significantly lower than most of its neighbors. Suitable for non-uniformly distributed data.
Advantages: gives a quantitative measure that the object is an outlier, and it can be handled well even if the data has different regions
Disadvantages: time complexity O (m ^ 2); parameter selection is difficult. Although the algorithm handles this problem by observing different k values to obtain the maximum outlier score, it still needs to select the upper and lower bounds of these values.
3.6 Cluster-based
Cluster-based outliers: An object is a cluster-based outlier, if the object is not strong and belongs to any cluster. The effect of outliers on initial clustering: if outliers are detected by clustering, there is a question as to whether the structure is effective because outliers affect clustering. In order to deal with this problem, the following methods can be used: clustering of objects, deleting outliers, and clustering of objects again.
advantage:
① Clustering techniques based on linear and near-linear complexity (k-means) to find outliers may be highly effective
② The definition of clusters is usually the complement of outliers, so clusters and outliers may be found at the same time
Disadvantages:
③ The set of outliers and their scores may be very dependent on the number of clusters used and the existence of outliers in the data
④ The quality of the clusters generated by the clustering algorithm has a great influence on the quality of the outliers generated by the algorithm
Ways to deal with abnormal points:
1> Delete outliers ---- It is obvious that it is abnormal and the number can be deleted directly
2> Do not process --- if the algorithm is not sensitive to outliers, you can not process it, but if the algorithm is sensitive to outliers, it is best not to use this method, such as some algorithms based on distance calculation, including kmeans, knn and the like of.
3> Mean value substitution ---- small loss information, simple and efficient.
4> Treat as missing value-can be handled according to the method of dealing with missing values
4. Deduplication
For the determination of duplicate items, the basic idea is "sort and merge", first sort the records in the data set according to a certain rule, and then detect whether the records are duplicated by comparing whether the adjacent records are similar. This actually contains two operations, one is sorting, and the other is calculating similarity. At present, in the competition process, the duplicated method is mainly used to judge, and then the repeated samples are simply deleted.
This case of blogs and some foreign competitions currently seen are basically processed by direct deletion. I have never seen a more innovative method.
5. Noise processing
Noise is the random error or variance of the measured variable, which is mainly distinguished from outliers. From the formula: Observation (Measurement) = True Data (True Data) + Noise (Noise). Outliers are observations, which may be caused by real data or noise, but in general there are significantly different observations from most observations. Noise includes erroneous values or outlier values that deviate from expectations, but it cannot be said that noise points contain outliers, although most data mining methods treat outliers as noise or anomalies and discard them. However, in some applications (for example: fraud detection), outlier analysis or anomaly mining will be conducted for outliers. And some points belong to outliers locally, but they are normal from a global perspective.
The processing of noise mainly adopts the bin method and the regression method:
(1) Split box method:
The binning method smoothes the ordered data values by examining the "close neighbors" of the data. These ordered values are distributed into some "buckets" or boxes. Since the binning method examines the values of nearest neighbors, it performs local smoothing.
l Smooth with box average: each value in the box is replaced by the average value in the box.
l Smooth with box median: each value in the box is replaced by the median in the box.
l Use box boundary smoothing: the maximum and minimum values in the box are also regarded as boundaries. Each value in the box is replaced by the nearest boundary value.
Generally speaking, the larger the width, the more obvious the smoothing effect. Boxes can also be of equal width, where the range of each box value is a constant. Binning can also be used as a discretization technique.
(2) Regression method
You can use a function to fit the data to smooth the data. Linear regression involves finding the "best" line that fits two attributes (or variables) so that one attribute can predict the other. Multilinear regression is an extension of linear regression, which involves more than two attributes, and the data is fitted to a multidimensional surface. Using regression to find mathematical equations that fit the data can help eliminate noise.