Data preprocessing (Python Scikit-learn)

Last Update:2018-05-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In machine learning tasks, data is often preprocessed. such as scale transformation, standardization, binary, regularization. As to which method is more effective, it is related to the distribution of data and the adoption of algorithms. Different algorithms have different assumptions about the data, may require different transformations, and sometimes do not need to be transformed, may also get relatively better results. Therefore, it is recommended to use a variety of data transformation methods, with a number of different algorithms to learn and test, choose a relatively good transformation method and algorithm.

The following is an introduction to the preprocessing process in the Python Scikit-learn library (also known as the Sklearn Library):

1. Load data set; 2. Divide the dataset into input variables and output variables for machine learning, 3. Transform (or preprocess) the input variables; 4. Displays the transform result (optional).

This article uses the IRIS data set from the Scikit-learn library (Iris Plants database) as an example.

First, load the dataset, get the input variable X and the output variable y, and the sample code is as follows:

 fromSklearnImportDatasetsImportNumPy as Npdata=Datasets.load_iris () X, y=Data.data, data.targetnp.set_printoptions (Precision=3)Print("\ n" "preprocess input variables:" "\ n")Print("Raw Data:")Print(X[:5,:])

Then, transform the input variable X (the type is <type ' Numpy.ndarray ' >), the specific transformation is as follows:

Scale transformation

Transforms an input variable into a range, such as a 0 to 1 interval. In the Sklearn library, use the Minmaxscaler class implementation. It is commonly used for gradient-descent-like optimization algorithms, weighted inputs in regression and neural networks, and distance measurements similar to K-nearest neighbors. The sample code is as follows:

 from Import  = Minmaxscaler (feature_range= (0,1= scaler.fit_transform (X)#  Print Transformed dataprint ("")print(rescaledx[0:5 ,:])

Standardization

Usually applies to the input variable of the Gaussian distribution. Specifically, subtract each attribute value from the input variable by its average and divide it by the standard deviation to get the value of the property for the standard normal distribution. In the Sklearn library, use the Standardscaler class implementation. It is often used for linear regression, logistic regression and linear decision analysis that assume the Gaussian distribution of input variables.

 from Import  == scaler.transform (X)
Print ("") Print (Standardizedx[0:5,:])

Standardization

Transforms an input variable into a data with a unit norm length. The usual norm has l1,l2, see my previous post "data normalization" principle and implementation (Python Sklearn). In the Sklearn library, use the Normalizer class implementation. It is often used for sparse datasets with many 0, such as neural networks with weighted input algorithms and distance metric algorithms like K-nearest neighbors.

 from Import  == scaler.transform (X)print ("")  Print (Normalizedx[0:5,:])

Binary Value

Use a threshold value to binary the input data. When the input variable value is greater than the threshold value, the transformation is 1, and when the input variable value is less than or equal to the threshold, the transformation is 0. In the Sklearn library, use the Binarizer class implementation. Often used to obtain a clear value of the probability of generating new meaningful properties of the feature engineering.

 from Import  = Binarizer (threshold=0.0= binarizer.transform (X)print ("  " )print (Binaryx[0:5,:])

Resources

Jason Brownlee. How to Prepare Your Data for machine learning in Python with Scikit-learn.

https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/

Data preprocessing (Python Scikit-learn)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data preprocessing (Python Scikit-learn)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data preprocessing (Python Scikit-learn)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support