2.1 Some understandings of the normalization of features

Source: Internet
Author: User

特征归一化There are a number of different names, such as: 特征缩放Feature NormalizationFeature Scaling

Data Standardization (normalization) processing is a basic work of data mining, the different evaluation indicators often have different dimensions and dimensional units, this situation will affect the results of data analysis, in order to eliminate the dimensional impact between the indicators, data standardization needs to be processed to solve the comparability of data indicators. After the raw data has been standardized, the indexes are in the same order of magnitude, which is suitable for comprehensive contrast evaluation.

特征归一化The meaning
    • The size range of each feature is consistent to use algorithms such as distance measurement
    • Convergence of accelerated gradient descent algorithm
    • In the SVM algorithm, the consistent feature can accelerate the search for support vectors time
    • Different machine learning algorithms, acceptable input value range is not the same
Here are two common normalization methods:
    1. Min-max Normalization (Min-max normalization) linear normalization

Known as dispersion normalization, is a linear transformation of the original data, which maps the resulting value between [0 1]. The conversion functions are as follows:

x^*=\frac{x-min}{max-min}

The method achieves equal scaling of the original data, in which the normalized data is the maximum value of the sample data for the original data, and the x^* x max min Minimum value for the sample data.

Disadvantages:

    • When new data is added, it can lead to changes in Max and Min, which need to be redefined.
    • Data is unstable, outliers and more noise are present

Advantages:

    • When we need to attribute the eigenvalues to a range [A, b], select Minmaxscaler
    1. 0 mean value Standardization (z-score standardization)

This method standardizes the data by giving the mean value (mean) and standard deviation of the original data (deviation). The original data set is normalized to a data set with an average of 0 and a variance of 1, and the conversion function is:

x^*=\frac{x-\mu}{\delta}

μ δ the mean values and methods of the original data set, respectively. This normalization requires the distribution of the original data to approximate the Gaussian distribution, otherwise the normalized effect becomes worse.

Advantages:

    • The maximum and minimum values that apply to the data are unknown, or there are outliers.
    1. Comparison

These are two common but commonly used normalization techniques, and what about the two normalization scenarios? When is the first method better, and when is the second method better? Here is a brief summary of the analysis:

    • In the classification, clustering algorithm, the need to use distance to measure similarity, or use PCA technology for dimensionality reduction, the second method (Z-score
      standardization) performed better.
    • The first method or other normalization method can be used when the distance metric, covariance calculation, and data non-conforming distribution are not involved. In the comparison process, the RGB image is converted to a grayscale image and its value is limited to [0
      255] range.
Below with PythonTo achieve the above

For example: Suppose there are 4 samples and their characteristics are as follows
Sample | Feature 1 | Feature 2
---|---|---
1 | 10001 | 2
2 | 16020 | 4
3 | 12008 | 6
4 | 13131 | 8

Before normalization is visible, the size of feature 1 and feature 2 is not an order of magnitude. Once normalized, the feature becomes

Sample Feature 1 feature 2
1 0 0
2 1 0.33
3 0.73 0.67
4 0.81 1
Min-max Normalization (Min-max normalization) linear normalization

Sklearn.preprocessing.MinMaxScaler
In sklearn , sklearn.preprocessing.MinMaxScaler is a method for feature normalization. Use examples such as the following

from sklearn.preprocessing import MinMaxScalerx = [[10001,2],[16020,4],[12008,6],[13131,8]]min_max_scaler = MinMaxScaler()X_train_minmax = min_max_scaler.fit_transform(x)#归一化后的结果X_train_minmaxarray([[ 0.        ,  0.        ],       [ 1.        ,  0.33333333],       [ 0.33344409,  0.66666667],       [ 0.52001994,  1.        ]])

It defaults to the value of each feature normalized to [0,1], and the normalized value size range is adjustable (adjusted according to the Minmaxscaler parameter feature_range). The following code allows the normalization of features between [ -1,1].

min_max_scaler = MinMaxScaler(feature_range=(-1,1))X_train_minmax = min_max_scaler.fit_transform(x)#归一化后的结果X_train_minmaxarray([[-1.        , -1.        ],       [ 1.        , -0.33333333],       [ 0.46574339,  0.33333333],       [ 0.6152873 ,  1.        ]])

The implementation formula for Minmaxscaler is as follows

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))X_scaled = X_std * (max - min) + min

This is an expression of vectorization, stating that X is a matrix, where

    • X_STD: Normalization of X to [0,1]
    • X.min (axis=0) indicates the minimum column value
    • Max,min represents MinMaxScaler the parameter feature_range parameter. That is, the size range of the final result

The following example shows the calculation process (max=1,min=0)
Sample | Feature 1 | Feature 2
---|---|---
1 | 10001 | 2
2 | 16020 | 4
3 | 12008 | 6
4 | 13131 | 8
X.max | 16020 | 8
X.min | 10001 | 2

The normalization process is as follows, assuming that the normalized matrix is s

    • s11= (10001-10001)/(16020-10001) =0
    • s21= (16020-10001)/(16020-10001) =1
    • s31= (12008-10001)/(16020-10001) =0.333444
    • s41= (13131-10001)/(16020-10001) =0.52002
    • S12= (2-2)/(8-2) =0
    • S22= (4-2)/(8-2) =0.33
    • S32= (6-2)/(8-2) =0.6667
    • S42= (8-2)/(8-2) =1

The results are consistent with the calculations in section "Minmaxscaler use".

Standardscaler Standardized method 0 Normalization of the mean value

Sklearn.preprocessing.StandardScaler
Sklearn.preprocessing.robust_scale
In sklearn , sklearn.preprocessing.StandardScaler is a method for feature normalization. Use examples such as the following

from sklearn.preprocessing import StandardScalerx = [[10001,2],[16020,4],[12008,6],[13131,8]]X_scaler = StandardScaler()X_train = X_scaler.fit_transform(x)X_trainarray([[-1.2817325 , -1.34164079],       [ 1.48440157, -0.4472136 ],       [-0.35938143,  0.4472136 ],       [ 0.15671236,  1.34164079]])

After normalization, the average value of each column of the matrix is 0, and the standard deviation is 1. Note that the standard deviation here refers to the standard deviation after the Delta Degrees of Freedom Factor, which differs from the conventional standard deviation formula. (In NumPy, there is the STD () function to calculate the standard deviation)

The normalization of the standardscaler is to subtract the column mean from each feature and divide it by the standard deviation of the column.
The following example shows the calculation process, noting that the standard deviation is calculated using NP.STD ().

Sample Feature 1 feature 2
1 10001 2
2 16020 4
3 12008 6
4 13131 8
Column mean value 12790 5
Column Standard deviation 2175.96 2.236

The normalization process is as follows, assuming that the normalized matrix is s

    • s11= (10001-12790)/2175.96=-1.28173
    • s21= (16020-12790)/2175.96=1.484
    • s31= (12008-12790)/2175.96=-0.35938
    • s41= (13131-12790)/2175.96=0.1567
    • S12= (2-5)/2.236=-1.342
    • S22= (4-5)/2.236=-0.447
    • S32= (6-5)/2.236=0.447
    • S42= (8-5)/2.236=1.3416

2.1 Some understandings of the normalization of features

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.