Common processes for data preprocessing:
1) Remove unique attributes
2) Handling Missing values
3) attribute encoding
4) standardization and regularization of data
5) Feature Selection
6) Principal component analysis
(1) Remove unique attributes
You often encounter unique properties in the data set that you get. These properties are typically added to some of the DI properties, such as the primary key that is stored in the database for self-increment. These properties do not characterize the distribution of the sample itself, so you simply need to delete these attributes.
(2) Three ways to handle missing values
1) Direct use of features with missing values 2) Delete feature with missing value 3) missing value completion
1) Direct use: For some algorithms you can directly use a condition that contains missing values. such as decision trees.
2) Feature deletion: The simplest way is to delete features that contain missing values. This is not a good idea for situations where only a small number of values are missing
3) Missing value completion: Interpolation of missing values with the most probable values is the most widely used in practical engineering. The most common concentration methods are: A mean interpolation b using the same mean interpolation C modeling to predict D high-dimensional mapping e multi-interpolation f maximum likelihood estimation g compression perception and matrix completion
A. Mean interpolation
If a continuous value is missing, the missing value is interpolated with the mean of the valid value of the attribute, and if the missing value is a discrete value, the missing value is interpolated with the majority of the valid value of the attribute
B. Same-value interpolation
The sample is sorted first, and then the missing values are interpolated with the mean of the sample in the class.
C. Modeling predictions
The missing attribute is predicted as a prediction target. This method works well, but there is a fundamental flaw in this approach: if the other attributes are unrelated to the missing attribute, the predicted result is meaningless. However, if the prediction results are fairly accurate, it is not necessary to consider the inclusion of the missing attribute in the data set. The general situation is somewhere between the two.
D. High-dimensional mapping
Map properties to high-dimensional space. This is the most precise approach, which completely retains all the information and does not add any additional information. For example, Google, Baidu's CTR Prediction model, pre-processing will be all the variables to deal with this, up to hundreds of millions of dimensions. The benefit of this is that the entire information of the original data is preserved intact, regardless of the missing value. But its shortcomings are also obvious, that is, the amount of computation greatly improved. And as long as the sample volume is very large when the effect is good, otherwise it will be too sparse, the effect is very poor
E. Multi-interpolation
It thinks that the value to be interpolated is random, and its value is derived from the observed value. In practice, we usually estimate the value to be interpolated, and then add different noises to form multiple sets of optional interpolation values.
F. Compression perception and matrix completion
Python vs machine learning-data preprocessing