First, there is a need to clarify the importance of data preparation for this step.
Since 2017, Ai has been in full swing, and because of the convenience of Python's easy-to-learn and machine learning libraries and frameworks, the introduction to machine learning and deep learning is almost 0 threshold, and now everyone is going to follow the AI (a phenomenon that is very repugnant to bloggers who are really committed to data and algorithms, But also can only helpless =. =). However, most of the attention is focused on the selection of algorithms and models, but ignores the effect of data preparation on the algorithm and the final results of the model. It can be said that without good data preparation, the algorithm is probably not fully functioning, and the model is likely to garbage in and garbage out. In fact, in the selection of the model, as long as the theoretical deviation (for example, the return when the classification processing =. =), the final difference between the results of different models is generally not very outrageous, that is, the choice of model and algorithm mostly determines the upper limit of the result (such as the elevation of accuracy). But where is the lower limit? This is the role of data preparation, which can be understood as data preprocessing. If the data is not processed well, how good can the upper limit of the result be? It can be said that the model---model-----model------data modeling------------------------ However, the goose really has a lot of the children's shoes have ignored this extremely critical step. Bo owner around a lot of just start contact algorithm of children's shoes, the Watermelon book Flower book as reading books, it is true that the model deduction is very important, this is indeed a data algorithm engineer's necessary skills. But the reason we need to derive the model is that we already have the support of the data. If there is no good data, the hand tearing model can tear out a flower? The real algorithm engineer is not only responsible for the hand RIP algorithm.
Okay, don't spit out the groove, get to the point.
First of all, you can easily divide the data into 2 parts: Data cleansing, feature engineering.
Data cleansing
That's what you usually say about washing data. This is necessary because the data available in reality can have a variety of difficult problems: data loss, data smoothing, data imbalance, data transformation, data normalization, data distribution, and so on.
Missing data: Common ways to handle this are:
Discard the sample directly: when the data volume is very large, when the amount of data is not suitable, it will result in too little data, the data distribution is biased and so on.
Artificial filling: Don't talk about, how can the massive data?
Fill with the same attribute mean, median, majority, and so on: a relatively simple method, a certain balance between data integrity and loss data information is ensured.
Using statistical methods to estimate missing data: Modeling the overall data of the attribute (regression, Bayesian, EM algorithm, etc.), estimating the most probable value, a good and scientific method.
Data smoothing: One is to data denoising, and the other is to deal with outliers, the common processing methods are as follows:
Regression: A kind of denoising, by fitting to get smooth data.
Sub-box: A kind of denoising, is the local smooth technology. The Division of the box has equal-width sub-box (refers to the value of the width), equal-frequency sub-box (refers to the number of samples, etc.), in the box used in the smooth data of the method has the value of the box, box boundaries, the median of the box and so on.
Data visualization: By plotting the data graph to find out the range of outliers, the benefits are intuitive, but not standardized enough, and the dimension is more than 3 difficult to visualize.
Clustering: not within the cluster collection is considered outliers, commonly used.
Value-mean <=2 times standard deviation: data with deviations exceeding twice times the standard deviation are considered outliers and are commonly used.
Data imbalance: That is, the data volume is small + positive and negative sample imbalance in various situations. The common processing methods are as follows:
Sampling: On-Sample (small-class repeated selection, may be over-fitting, so the inclusion of random disturbances)/down sampling (VW culling part, may be less than fit, so there are such as Easyensemble method, similar to the idea of random forest), on the sampling generally better.
Weighted: Similar to the boosting concept, the effect is generally not as good as sampling.
Data synthesis: In the face of a small amount of data, such as Smote method, using the original data as a neighbor, adding random disturbances to synthesize new data.
Classification: The problem can be transformed into a classification problem when the positive and negative sample size is extremely large, and is not considered as an anomaly within the classification.
Data transformation: The transformation of data to the desired form or range, the common processing methods are as follows:
Maximum minimum normalization: New value = (original value-min)/(max-min), data can be normalized to the [0,1] interval. What is more volatile is that it is affected by outliers, and the benefit is the ability to react to potential relationships in more data.
Z-score Normalization: The New value = (original value-mean)/standard deviation, can be data to the mean value of 0, the standard deviation is 1 of the distribution, commonly used, and can be calculated at the distance of the different properties to play an equivalent role.
Decimal calibration normalization: N-times (n pickup) that will reduce the original value by 10.
Discretization: In fact, and the data is smooth in the case of the method of a truth.
Aggregation: As the name implies, aggregating multiple attributes together is considered a property.
Data reduction: The use of the technology to obtain a smaller amount of data without losing integrity, the common processing methods are as follows:
Value reduction: Use regression, clustering, sampling (put back random/not put back random/stratified sampling, etc.) to get smaller and more complete data.
Discretization, Conceptual layering: discretization in a discrete reference data transformation; Conceptual layering refers to the substitution of low-level concepts with high levels of concepts, which reduces the number of features (more abstract).
Data compression: Use encoding to transform data to get smaller data. Lossless compression: See many lossless compression algorithms on audio, lossy compression: wavelet transform, PCA, etc.
Dimension reduction: It is necessary to check the correlation between features, to remove the weak correlation and redundancy, and to calculate the correlation by correlation coefficient (continuous type), Chi-square test (discrete type), etc.
Data distribution: That is, training sets, validation sets, test set allocation, the common processing methods are as follows:
Retention method: The data is divided into several parts, respectively, for training, verification, testing. In fact, nothing bad, is the amount of data is not very useful when the small.
Cross-validation: well-known cross-validation, the data is divided into several parts (so also called K-fold cross-validation method), cycle selection, respectively, for training, verification, testing, because the training test several times, so the final result is the mean.
Feature Engineering
This part of feature engineering is coincident with the above data cleaning, such as discretization of features, compression of feature dimensions, etc. But this is not very good classification, so here will also write a section of similar content.
In fact, the most likely concern in the feature engineering is the dimensionality reduction, so first of all to say dimensionality.
Tomorrow and more ...
[Data algorithm engineer] data preparation (outline)