In machine learning tasks, data is often preprocessed. such as scale transformation, standardization, binary, regularization. As to which method is more effective, it is related to the distribution of data and the adoption of algorithms. Different algorithms have different assumptions about the data, may require different transformations, and sometimes do not need to be transformed, may also get relatively better results. Therefore, it is recommended to use a variety of data transformation methods, with a number of different algorithms to learn and test, choose a relatively good transformation method and algorithm.
The following is an introduction to the preprocessing process in the Python Scikit-learn library (also known as the Sklearn Library):
1. Load data set; 2. Divide the dataset into input variables and output variables for machine learning, 3. Transform (or preprocess) the input variables; 4. Displays the transform result (optional).
This article uses the IRIS data set from the Scikit-learn library (Iris Plants database) as an example.
First, load the dataset, get the input variable X and the output variable y, and the sample code is as follows:
fromSklearnImportDatasetsImportNumPy as Npdata=Datasets.load_iris () X, y=Data.data, data.targetnp.set_printoptions (Precision=3)Print("\ n" "preprocess input variables:" "\ n")Print("Raw Data:")Print(X[:5,:])
Then, transform the input variable X (the type is <type ' Numpy.ndarray ' >), the specific transformation is as follows:
Scale transformation
Transforms an input variable into a range, such as a 0 to 1 interval. In the Sklearn library, use the Minmaxscaler class implementation. It is commonly used for gradient-descent-like optimization algorithms, weighted inputs in regression and neural networks, and distance measurements similar to K-nearest neighbors. The sample code is as follows:
from Import = Minmaxscaler (feature_range= (0,1= scaler.fit_transform (X)# Print Transformed dataprint ("")print(rescaledx[0:5 ,:])
Standardization
Usually applies to the input variable of the Gaussian distribution. Specifically, subtract each attribute value from the input variable by its average and divide it by the standard deviation to get the value of the property for the standard normal distribution. In the Sklearn library, use the Standardscaler class implementation. It is often used for linear regression, logistic regression and linear decision analysis that assume the Gaussian distribution of input variables.
from Import == scaler.transform (X)
Print ("") Print (Standardizedx[0:5,:])
Standardization
Transforms an input variable into a data with a unit norm length. The usual norm has l1,l2, see my previous post "data normalization" principle and implementation (Python Sklearn). In the Sklearn library, use the Normalizer class implementation. It is often used for sparse datasets with many 0, such as neural networks with weighted input algorithms and distance metric algorithms like K-nearest neighbors.
from Import == scaler.transform (X)print ("") Print (Normalizedx[0:5,:])
Binary Value
Use a threshold value to binary the input data. When the input variable value is greater than the threshold value, the transformation is 1, and when the input variable value is less than or equal to the threshold, the transformation is 0. In the Sklearn library, use the Binarizer class implementation. Often used to obtain a clear value of the probability of generating new meaningful properties of the feature engineering.
from Import = Binarizer (threshold=0.0= binarizer.transform (X)print (" " )print (Binaryx[0:5,:])
Resources
Jason Brownlee. How to Prepare Your Data for machine learning in Python with Scikit-learn.
https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/
Data preprocessing (Python Scikit-learn)