Yi Hundred tutorial ai python correction-ai data preparation-preprocessing data

Source: Internet
Author: User
Tags numeric value

Preprocessing data

In our daily lives, we need to process a lot of data, but this data is raw data. In order to provide the input of the data as a machine learning algorithm, it needs to be converted into meaningful data. This is where data preprocessing enters the image. In other words, we need to preprocess the data before we can provide it to the machine learning algorithm.

Data preprocessing steps

Follow these steps to preprocess data in Python-

Step 1th -Import useful packages-if you use Python, this will be the first step in converting the data to a specific format (that is, preprocessing). The following code-

Import NumPy as NP  from Import preprocessing

Python

The following two packages are used here-

    • NumPy -basically NumPy is a generic array processing package designed to efficiently handle large multidimensional arrays of arbitrary records without sacrificing the speed of small multidimensional arrays.
    • sklearn.preprocessing -This package provides a number of common utility functions and converter classes to change the original eigenvectors to a more appropriate representation of the machine learning algorithm.

Step 2nd -Define sample data-After importing the package, you need to define some sample data so that preprocessing techniques can be applied to the data. The following sample data is now defined-

Input_data = Np.array ([2.1, -1.9, 5.5],                      [-1.5, 2.4, 3.5],                      [0.5, -7.9, 5.6],                      [ 5.9, 2.3, 5.8]])

Python

3rd Step -Application preprocessing technology-in this step, we need to apply preprocessing techniques.

The following sections describe data preprocessing techniques.

Data preprocessing Technology

The following describes the data preprocessing technology-

Binary Value

This is the preprocessing technique that is used when a numeric value needs to be converted to a Boolean. We can use a built-in method to two value input data, for example 0.5 , as a threshold value, the method is as follows-

data_binarized = preprocessing. Binarizer (threshold = 0.5). Transform (input_data)print("\nbinarized data:\n ", data_binarized)

Python

Now, after running the above code, you will get the following output, all 0.5 values above (threshold) will be converted to 1 , and all 0.5 values below will be converted to 0 .

Data of binary Value

[[1.0.1.][ 0. 1. 1.] [ 0. 0. 1.] [ 1. 1. 0.] ]< /c4> 
Average removal

This is another very common preprocessing technique used in machine learning. Basically it is used to eliminate the mean of the eigenvectors so that each feature is centered at zero. You can also eliminate feature deviations in the eigenvectors. In order to apply the average removal preprocessing technique to sample data, the following Python code can be written. The code displays the average and standard deviation of the input data-

Print ("", input_data.mean (axis = 0)) Print ("", input_data.std (axis = 0))

When you run the preceding line of code, you get the following output-

= [ 1.75       -1.275 2.2]Std deviation = [ 2.71431391 4.20022321 4.69414529]
Now, the following code will delete the average and standard deviation of the input data-
data_scaled = Preprocessing.scale (input_data)print("Mean =", Data_ Scaled.mean (axis=0))print("Std deviation =", data_scaled.std (axis = 0) )

When you run the preceding line of code, you get the following output-

= [ 1.11022302e-16 0.00000000e+00 0.00000000e+00]Std deviation = [ 1. 1. 1.]


Scaling

This is another data preprocessing technique for scaling feature vectors. The scaling of eigenvectors is required because the values of each feature can vary between many random values. In other words, we can say that scaling is very important because we don't want any feature to be synthesized to be large or small. With the following Python code, we can scale the input data, which is the feature vector-

Minimum maximum zoom

Data_scaler_minmax = preprocessing. Minmaxscaler (feature_range= (0,1= data_scaler_minmax.fit_transform (input_data)print ( " \nmin Max Scaled data:\n ", Data_scaled_minmax)

When you run the preceding line of code, you get the following output-

[[0.486486490.582524270.99122807][0. 10.81578947][0.27027027 01][1. 099029126 0 ]"          
PythonNormalization

This is another data preprocessing technique used to modify feature vectors. This modification is necessary for the measurement of eigenvectors on a common scale. The following are two types of standardization available for machine learning-

L1 standardization

It is also known as the minimum absolute deviation. This normalization modifies these values so that the sum of the absolute values is always at most in each row 1 . It can be implemented in the following Python code, using the input data above-

# Normalize Data ' L1 ' )print("\nl1 normalized data:\n", DATA_NORMALIZED_L1)

The above line of code produces the following output:

L1 normalized data:[[0.22105263 -0.2 0.57894737][ -0.2027027 0.32432432 0.47297297][0.03571429 -0.56428571 0.4 ][0.42142857 0.16428571 - 0.41428571]    
Python

L2 standardization

It is also known as the least squares. This normalization modifies these values so that the sum of squares in each row is always the maximum 1 . It can be implemented in the following Python code, using the input data above-

# Normalize Data ' L2 ' )print("\nl2 normalized data:\n", DATA_NORMALIZED_L2)

Executing the above code line produces the following output-

L2 normalized data:[[0.33946114-0.30713151 0.88906489][- 0.33325106 0.53320169 0.7775858 ][0.05156558 -0.81473612 0.57753446][0.68706914 0.26784051 -0.6754239 ] 


Easy tutorial ai python fix-ai data preparation-preprocessing data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.