Data preprocessing (Python Scikit-learn)

Source: Internet
Author: User

In machine learning tasks, data is often preprocessed. such as scale transformation, standardization, binary, regularization. As to which method is more effective, it is related to the distribution of data and the adoption of algorithms. Different algorithms have different assumptions about the data, may require different transformations, and sometimes do not need to be transformed, may also get relatively better results. Therefore, it is recommended to use a variety of data transformation methods, with a number of different algorithms to learn and test, choose a relatively good transformation method and algorithm.

The following is an introduction to the preprocessing process in the Python Scikit-learn library (also known as the Sklearn Library):

1. Load data set; 2. Divide the dataset into input variables and output variables for machine learning, 3. Transform (or preprocess) the input variables; 4. Displays the transform result (optional).

This article uses the IRIS data set from the Scikit-learn library (Iris Plants database) as an example.

First, load the dataset, get the input variable X and the output variable y, and the sample code is as follows:

 fromSklearnImportDatasetsImportNumPy as Npdata=Datasets.load_iris () X, y=Data.data, data.targetnp.set_printoptions (Precision=3)Print("\ n" "preprocess input variables:" "\ n")Print("Raw Data:")Print(X[:5,:])

Then, transform the input variable X (the type is <type ' Numpy.ndarray ' >), the specific transformation is as follows:

Scale transformation

Transforms an input variable into a range, such as a 0 to 1 interval. In the Sklearn library, use the Minmaxscaler class implementation. It is commonly used for gradient-descent-like optimization algorithms, weighted inputs in regression and neural networks, and distance measurements similar to K-nearest neighbors. The sample code is as follows:

 from Import  = Minmaxscaler (feature_range= (0,1= scaler.fit_transform (X)#  Print Transformed dataprint ("")print(rescaledx[0:5 ,:])

Standardization

Usually applies to the input variable of the Gaussian distribution. Specifically, subtract each attribute value from the input variable by its average and divide it by the standard deviation to get the value of the property for the standard normal distribution. In the Sklearn library, use the Standardscaler class implementation. It is often used for linear regression, logistic regression and linear decision analysis that assume the Gaussian distribution of input variables.

 from Import  == scaler.transform (X)
Print ("") Print (Standardizedx[0:5,:])

Standardization

Transforms an input variable into a data with a unit norm length. The usual norm has l1,l2, see my previous post "data normalization" principle and implementation (Python Sklearn). In the Sklearn library, use the Normalizer class implementation. It is often used for sparse datasets with many 0, such as neural networks with weighted input algorithms and distance metric algorithms like K-nearest neighbors.

 from Import  == scaler.transform (X)print ("")  Print (Normalizedx[0:5,:])

Binary Value

Use a threshold value to binary the input data. When the input variable value is greater than the threshold value, the transformation is 1, and when the input variable value is less than or equal to the threshold, the transformation is 0. In the Sklearn library, use the Binarizer class implementation. Often used to obtain a clear value of the probability of generating new meaningful properties of the feature engineering.

 from Import  = Binarizer (threshold=0.0= binarizer.transform (X)print ("  " )print (Binaryx[0:5,:])

Resources

Jason Brownlee. How to Prepare Your Data for machine learning in Python with Scikit-learn.

https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/

Data preprocessing (Python Scikit-learn)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.