特征归一化
There are a number of different names, such as:
特征缩放
,
Feature Normalization
,
Feature Scaling
Data Standardization (normalization) processing is a basic work of data mining, the different evaluation indicators often have different dimensions and dimensional units, this situation will affect the results of data analysis, in order to eliminate the dimensional impact between the indicators, data standardization needs to be processed to solve the comparability of data indicators. After the raw data has been standardized, the indexes are in the same order of magnitude, which is suitable for comprehensive contrast evaluation.
特征归一化
The meaning
- The size range of each feature is consistent to use algorithms such as distance measurement
- Convergence of accelerated gradient descent algorithm
- In the SVM algorithm, the consistent feature can accelerate the search for support vectors time
- Different machine learning algorithms, acceptable input value range is not the same
Here are two common normalization methods:
- Min-max Normalization (Min-max normalization) linear normalization
Known as dispersion normalization, is a linear transformation of the original data, which maps the resulting value between [0 1]. The conversion functions are as follows:
x^*=\frac{x-min}{max-min}
The method achieves equal scaling of the original data, in which the normalized data is the maximum value of the sample data for the original data, and the x^*
x
max
min
Minimum value for the sample data.
Disadvantages:
- When new data is added, it can lead to changes in Max and Min, which need to be redefined.
- Data is unstable, outliers and more noise are present
Advantages:
- When we need to attribute the eigenvalues to a range [A, b], select Minmaxscaler
- 0 mean value Standardization (z-score standardization)
This method standardizes the data by giving the mean value (mean) and standard deviation of the original data (deviation). The original data set is normalized to a data set with an average of 0 and a variance of 1, and the conversion function is:
x^*=\frac{x-\mu}{\delta}
μ
δ
the mean values and methods of the original data set, respectively. This normalization requires the distribution of the original data to approximate the Gaussian distribution, otherwise the normalized effect becomes worse.
Advantages:
- The maximum and minimum values that apply to the data are unknown, or there are outliers.
- Comparison
These are two common but commonly used normalization techniques, and what about the two normalization scenarios? When is the first method better, and when is the second method better? Here is a brief summary of the analysis:
- In the classification, clustering algorithm, the need to use distance to measure similarity, or use PCA technology for dimensionality reduction, the second method (Z-score
standardization) performed better.
- The first method or other normalization method can be used when the distance metric, covariance calculation, and data non-conforming distribution are not involved. In the comparison process, the RGB image is converted to a grayscale image and its value is limited to [0
255] range.
Below with
Python
To achieve the above
For example: Suppose there are 4 samples and their characteristics are as follows
Sample | Feature 1 | Feature 2
---|---|---
1 | 10001 | 2
2 | 16020 | 4
3 | 12008 | 6
4 | 13131 | 8
Before normalization is visible, the size of feature 1 and feature 2 is not an order of magnitude. Once normalized, the feature becomes
Sample |
Feature 1 |
feature 2 |
1 |
0 |
0 |
2 |
1 |
0.33 |
3 |
0.73 |
0.67 |
4 |
0.81 |
1 |
Min-max Normalization (Min-max normalization) linear normalization
Sklearn.preprocessing.MinMaxScaler
In sklearn
, sklearn.preprocessing.MinMaxScaler
is a method for feature normalization. Use examples such as the following
from sklearn.preprocessing import MinMaxScalerx = [[10001,2],[16020,4],[12008,6],[13131,8]]min_max_scaler = MinMaxScaler()X_train_minmax = min_max_scaler.fit_transform(x)#归一化后的结果X_train_minmaxarray([[ 0. , 0. ], [ 1. , 0.33333333], [ 0.33344409, 0.66666667], [ 0.52001994, 1. ]])
It defaults to the value of each feature normalized to [0,1], and the normalized value size range is adjustable (adjusted according to the Minmaxscaler parameter feature_range). The following code allows the normalization of features between [ -1,1].
min_max_scaler = MinMaxScaler(feature_range=(-1,1))X_train_minmax = min_max_scaler.fit_transform(x)#归一化后的结果X_train_minmaxarray([[-1. , -1. ], [ 1. , -0.33333333], [ 0.46574339, 0.33333333], [ 0.6152873 , 1. ]])
The implementation formula for Minmaxscaler is as follows
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))X_scaled = X_std * (max - min) + min
This is an expression of vectorization, stating that X is a matrix, where
- X_STD: Normalization of X to [0,1]
- X.min (axis=0) indicates the minimum column value
- Max,min represents
MinMaxScaler
the parameter feature_range
parameter. That is, the size range of the final result
The following example shows the calculation process (max=1,min=0)
Sample | Feature 1 | Feature 2
---|---|---
1 | 10001 | 2
2 | 16020 | 4
3 | 12008 | 6
4 | 13131 | 8
X.max | 16020 | 8
X.min | 10001 | 2
The normalization process is as follows, assuming that the normalized matrix is s
- s11= (10001-10001)/(16020-10001) =0
- s21= (16020-10001)/(16020-10001) =1
- s31= (12008-10001)/(16020-10001) =0.333444
- s41= (13131-10001)/(16020-10001) =0.52002
- S12= (2-2)/(8-2) =0
- S22= (4-2)/(8-2) =0.33
- S32= (6-2)/(8-2) =0.6667
- S42= (8-2)/(8-2) =1
The results are consistent with the calculations in section "Minmaxscaler use".
Standardscaler Standardized method 0 Normalization of the mean value
Sklearn.preprocessing.StandardScaler
Sklearn.preprocessing.robust_scale
In sklearn
, sklearn.preprocessing.StandardScaler
is a method for feature normalization. Use examples such as the following
from sklearn.preprocessing import StandardScalerx = [[10001,2],[16020,4],[12008,6],[13131,8]]X_scaler = StandardScaler()X_train = X_scaler.fit_transform(x)X_trainarray([[-1.2817325 , -1.34164079], [ 1.48440157, -0.4472136 ], [-0.35938143, 0.4472136 ], [ 0.15671236, 1.34164079]])
After normalization, the average value of each column of the matrix is 0, and the standard deviation is 1. Note that the standard deviation here refers to the standard deviation after the Delta Degrees of Freedom Factor, which differs from the conventional standard deviation formula. (In NumPy, there is the STD () function to calculate the standard deviation)
The normalization of the standardscaler is to subtract the column mean from each feature and divide it by the standard deviation of the column.
The following example shows the calculation process, noting that the standard deviation is calculated using NP.STD ().
Sample |
Feature 1 |
feature 2 |
1 |
10001 |
2 |
2 |
16020 |
4 |
3 |
12008 |
6 |
4 |
13131 |
8 |
Column mean value |
12790 |
5 |
Column Standard deviation |
2175.96 |
2.236 |
The normalization process is as follows, assuming that the normalized matrix is s
- s11= (10001-12790)/2175.96=-1.28173
- s21= (16020-12790)/2175.96=1.484
- s31= (12008-12790)/2175.96=-0.35938
- s41= (13131-12790)/2175.96=0.1567
- S12= (2-5)/2.236=-1.342
- S22= (4-5)/2.236=-0.447
- S32= (6-5)/2.236=0.447
- S42= (8-5)/2.236=1.3416
2.1 Some understandings of the normalization of features