Data mining and machine learning, in fact, most of the time is not in the algorithm, but in the data, after all, the algorithm is often ready-made, the room for change is very small.
The purpose of data preprocessing is to organize the data into a standard form.
1. Normalization
Normalization usually takes two methods.
A. Simplest normalization, maximum minimum value mapping method
p_new= (P-MI)/(MA-MI)
P is the original data, MI is the minimum value in this attribute, MA is the maximum value in this attribute. After this is processed, all values are limited to 0-1.
B. Standardization of standard deviation
P_new= (P-avg (P))/sd (p)
where AVG (p) is the variable mean, SD (p) is the standard deviation.
One of the benefits of this approach is that when you find that you are dealing with something that is bizarre, you can think of it as outliers and reject it directly.
2. discretization
If your numbers are sequential, sometimes it's not that good to handle, like age. It is more meaningful to scatter numbers into children, teenagers, youth, etc.
3. Missing value problem
The first thing to consider is the number of missing values, if too much, rather than directly delete the attributes, if within the acceptable range, the use of average, maximum or other suitable scheme to complement.
There is, of course, a way to model a record that is not missing using Method 1, and then use that method to predict the missing value, and then use Method 2 to finally model it. Of course, there are a lot of problems here, such as the accuracy of method one, Method 1 and Method 2, which generate information redundancy when using the same method.
4, abnormal data points
Many of the actual data sets are abnormal data, which may be due to typographical errors or interference from the acquisition. The most commonly used methods to eliminate abnormal data are the following two types.
Finding nearby points is considered an anomaly when the distance from the nearest point is greater than the threshold value. It is also possible to think of an anomaly within a limited distance, with less than a certain number of data points.
The former is based on the distance, the latter is based on density. Of course, you can also combine the two, specify the distance and also specify the number, which is called COF.
5, the Data screening
After we have preprocessed the data, sometimes the dimension of the data is very large, for economic reasons, of course, the need for dimensionality reduction or feature selection. Sometimes descending and feature selection also increases accuracy.
dimensionality reduction is usually using PCA, principal component analysis. Intuitively, it is to make a linear combination of several variables, to become a variable, feature selection is relatively simple, is the selection of strong correlation characteristics.
Of course, the PCA design to the matrix singular value decomposition, the specific mathematical principle will not unfold.