Python vs machine learning-data preprocessing

Source: Internet
Author: User

Common processes for data preprocessing:

1) Remove unique attributes

2) Handling Missing values

3) attribute encoding

4) standardization and regularization of data

5) Feature Selection

6) Principal component analysis

(1) Remove unique attributes

You often encounter unique properties in the data set that you get. These properties are typically added to some of the DI properties, such as the primary key that is stored in the database for self-increment. These properties do not characterize the distribution of the sample itself, so you simply need to delete these attributes.

(2) Three ways to handle missing values

1) Direct use of features with missing values 2) Delete feature with missing value 3) missing value completion

1) Direct use: For some algorithms you can directly use a condition that contains missing values. such as decision trees.

2) Feature deletion: The simplest way is to delete features that contain missing values. This is not a good idea for situations where only a small number of values are missing

3) Missing value completion: Interpolation of missing values with the most probable values is the most widely used in practical engineering. The most common concentration methods are: A mean interpolation b using the same mean interpolation C modeling to predict D high-dimensional mapping e multi-interpolation f maximum likelihood estimation g compression perception and matrix completion

A. Mean interpolation

If a continuous value is missing, the missing value is interpolated with the mean of the valid value of the attribute, and if the missing value is a discrete value, the missing value is interpolated with the majority of the valid value of the attribute

B. Same-value interpolation

The sample is sorted first, and then the missing values are interpolated with the mean of the sample in the class.

C. Modeling predictions

The missing attribute is predicted as a prediction target. This method works well, but there is a fundamental flaw in this approach: if the other attributes are unrelated to the missing attribute, the predicted result is meaningless. However, if the prediction results are fairly accurate, it is not necessary to consider the inclusion of the missing attribute in the data set. The general situation is somewhere between the two.

D. High-dimensional mapping

Map properties to high-dimensional space. This is the most precise approach, which completely retains all the information and does not add any additional information. For example, Google, Baidu's CTR Prediction model, pre-processing will be all the variables to deal with this, up to hundreds of millions of dimensions. The benefit of this is that the entire information of the original data is preserved intact, regardless of the missing value. But its shortcomings are also obvious, that is, the amount of computation greatly improved. And as long as the sample volume is very large when the effect is good, otherwise it will be too sparse, the effect is very poor

E. Multi-interpolation

It thinks that the value to be interpolated is random, and its value is derived from the observed value. In practice, we usually estimate the value to be interpolated, and then add different noises to form multiple sets of optional interpolation values.

F. Compression perception and matrix completion

  

Python vs machine learning-data preprocessing

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.