Nine kinds of preprocessing methods commonly used in Python share _python

Source: Internet
Author: User

This article summarizes all of us in Python common data preprocessing methods, the following through the Sklearn preprocessing module to introduce;

1. Standardization (standardization or Mean removal and variance scaling)

After transformation, the characteristics of each dimension have 0 mean value, unit variance. Also called Z-score normalization (0 mean value normalization). The method of calculation is to subtract the eigenvalues from the average, divided by the standard deviation.

Sklearn.preprocessing.scale (X)

Typically, train and test sets are standardized together, or standardized on the train set, using the same standard to standardize test sets, where you can use the Scaler

Scaler = Sklearn.preprocessing.StandardScaler (). Fit (train)
Scaler.transform (train)
scaler.transform (test )

In practical applications, it is necessary to do the common situation of characteristic standardization: SVM

2. Minimum-Maximum normalization

Minimum-Maximum normalization transforms the original data into a linear transformation to [0,1] intervals (or to other fixed minimum maximum values)

Min_max_scaler = Sklearn.preprocessing.MinMaxScaler ()
min_max_scaler.fit_transform (X_train)

3. Normalization (normalization)

Normalization is the mapping of values of varying ranges to the same fixed range, often [0,1], which is also called normalization.

Transform each sample into a unit norm.

x = [[1,-1, 2],[2, 0, 0], [0, 1,-1]]
sklearn.preprocessing.normalize (x, norm= ' L2 ')

Get:

Array ([[0.40,-0.40, 0.81], [1, 0, 0], [0, 0.70,-0.70]])

It can be found that for every sample there is, 0.4^2+0.4^2+0.81^2=1, which is the L2 norm, the square sum of each dimension feature of each sample after transformation is 1. Similarly, L1 norm is the absolute value of each dimension feature of each sample after transformation and is 1. and Max norm, divides each dimension feature of each sample by the maximum value of each dimension feature of the sample.
When measuring the similarity between samples, if you are using a two-kernel, you need to do normalization

4. Characteristics of the binary (binarization)

Converts the feature to 0/1 given the threshold value

Binarizer = Sklearn.preprocessing.Binarizer (threshold=1.1)
binarizer.transform (X)

5. Label binary (label binarization)

LB = Sklearn.preprocessing.LabelBinarizer ()

6. Category feature Encoding

Sometimes the features are of type, and some of the algorithms must be numeric, and they need to be encoded.

ENC = preprocessing. Onehotencoder () enc.fit ([[0, 0, 3], [1, 1, 0], [0, 2, 1]
, [1, 0, 2]])
enc.transform ([[0, 1, 3]]). ToArray () #array ( [[1., 0., 0., 1., 0., 0., 0., 0., 1.]]

In the example above, the first-dimensional feature has two values of 0 and 1, encoded in two digits. The second dimension uses three Bits, and the third dimension uses four bits.

Another way to encode

 
 

7. Label encoding (label encoding)

Le = Sklearn.preprocessing.LabelEncoder () 
le.fit ([1, 2, 2, 6]) 
le.transform ([1, 1, 2, 6]) #array ([0, 0, 1, 2]) 
   
     #非数值型转化为数值型
Le.fit (["Paris", "Paris", "Tokyo", "Amsterdam"])
le.transform (["Tokyo", "Tokyo", "Paris"]) Array ([2, 2, 1])
   

8. When the characteristic contains the abnormal value

Sklearn.preprocessing.robust_scale

9. Generating polynomial features

This actually involves feature engineering, polynomial feature/crossover feature.

Poly = Sklearn.preprocessing.PolynomialFeatures (2)
Poly.fit_transform (X)

Original features:

After conversion:

Summarize

The above is for you to summarize python commonly used in nine kinds of preprocessing methods to share, I hope this article for you to learn or use Python can have some help, if you have questions you can message exchange.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.