This article summarizes all of us in Python common data preprocessing methods, the following through the Sklearn preprocessing module to introduce;
1. Standardization (standardization or Mean removal and variance scaling)
After transformation, the characteristics of each dimension have 0 mean value, unit variance. Also called Z-score normalization (0 mean value normalization). The method of calculation is to subtract the eigenvalues from the average, divided by the standard deviation.
Sklearn.preprocessing.scale (X)
Typically, train and test sets are standardized together, or standardized on the train set, using the same standard to standardize test sets, where you can use the Scaler
Scaler = Sklearn.preprocessing.StandardScaler (). Fit (train)
Scaler.transform (train)
scaler.transform (test )
In practical applications, it is necessary to do the common situation of characteristic standardization: SVM
2. Minimum-Maximum normalization
Minimum-Maximum normalization transforms the original data into a linear transformation to [0,1] intervals (or to other fixed minimum maximum values)
Min_max_scaler = Sklearn.preprocessing.MinMaxScaler ()
min_max_scaler.fit_transform (X_train)
3. Normalization (normalization)
Normalization is the mapping of values of varying ranges to the same fixed range, often [0,1], which is also called normalization.
Transform each sample into a unit norm.
x = [[1,-1, 2],[2, 0, 0], [0, 1,-1]]
sklearn.preprocessing.normalize (x, norm= ' L2 ')
Get:
Array ([[0.40,-0.40, 0.81], [1, 0, 0], [0, 0.70,-0.70]])
It can be found that for every sample there is, 0.4^2+0.4^2+0.81^2=1, which is the L2 norm, the square sum of each dimension feature of each sample after transformation is 1. Similarly, L1 norm is the absolute value of each dimension feature of each sample after transformation and is 1. and Max norm, divides each dimension feature of each sample by the maximum value of each dimension feature of the sample.
When measuring the similarity between samples, if you are using a two-kernel, you need to do normalization
4. Characteristics of the binary (binarization)
Converts the feature to 0/1 given the threshold value
Binarizer = Sklearn.preprocessing.Binarizer (threshold=1.1)
binarizer.transform (X)
5. Label binary (label binarization)
LB = Sklearn.preprocessing.LabelBinarizer ()
6. Category feature Encoding
Sometimes the features are of type, and some of the algorithms must be numeric, and they need to be encoded.
ENC = preprocessing. Onehotencoder () enc.fit ([[0, 0, 3], [1, 1, 0], [0, 2, 1]
, [1, 0, 2]])
enc.transform ([[0, 1, 3]]). ToArray () #array ( [[1., 0., 0., 1., 0., 0., 0., 0., 1.]]
In the example above, the first-dimensional feature has two values of 0 and 1, encoded in two digits. The second dimension uses three Bits, and the third dimension uses four bits.
Another way to encode
7. Label encoding (label encoding)
Le = Sklearn.preprocessing.LabelEncoder ()
le.fit ([1, 2, 2, 6])
le.transform ([1, 1, 2, 6]) #array ([0, 0, 1, 2])
#非数值型转化为数值型
Le.fit (["Paris", "Paris", "Tokyo", "Amsterdam"])
le.transform (["Tokyo", "Tokyo", "Paris"]) Array ([2, 2, 1])
8. When the characteristic contains the abnormal value
Sklearn.preprocessing.robust_scale
9. Generating polynomial features
This actually involves feature engineering, polynomial feature/crossover feature.
Poly = Sklearn.preprocessing.PolynomialFeatures (2)
Poly.fit_transform (X)
Original features:
After conversion:
Summarize
The above is for you to summarize python commonly used in nine kinds of preprocessing methods to share, I hope this article for you to learn or use Python can have some help, if you have questions you can message exchange.