International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Data preprocessing and data screening of machine learning

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Data mining and machine learning, in fact, most of the time is not in the algorithm, but in the data, after all, the algorithm is often ready-made, the room for change is very small.

The purpose of data preprocessing is to organize the data into a standard form.

1. Normalization

Normalization usually takes two methods.

A. Simplest normalization, maximum minimum value mapping method

p_new= (P-MI)/(MA-MI)

P is the original data, MI is the minimum value in this attribute, MA is the maximum value in this attribute. After this is processed, all values are limited to 0-1.

B. Standardization of standard deviation

P_new= (P-avg (P))/sd (p)
where AVG (p) is the variable mean, SD (p) is the standard deviation.

One of the benefits of this approach is that when you find that you are dealing with something that is bizarre, you can think of it as outliers and reject it directly.

2. discretization

If your numbers are sequential, sometimes it's not that good to handle, like age. It is more meaningful to scatter numbers into children, teenagers, youth, etc.

3. Missing value problem

The first thing to consider is the number of missing values, if too much, rather than directly delete the attributes, if within the acceptable range, the use of average, maximum or other suitable scheme to complement.

There is, of course, a way to model a record that is not missing using Method 1, and then use that method to predict the missing value, and then use Method 2 to finally model it. Of course, there are a lot of problems here, such as the accuracy of method one, Method 1 and Method 2, which generate information redundancy when using the same method.

4, abnormal data points

Many of the actual data sets are abnormal data, which may be due to typographical errors or interference from the acquisition. The most commonly used methods to eliminate abnormal data are the following two types.

Finding nearby points is considered an anomaly when the distance from the nearest point is greater than the threshold value. It is also possible to think of an anomaly within a limited distance, with less than a certain number of data points.

The former is based on the distance, the latter is based on density. Of course, you can also combine the two, specify the distance and also specify the number, which is called COF.

5, the Data screening

After we have preprocessed the data, sometimes the dimension of the data is very large, for economic reasons, of course, the need for dimensionality reduction or feature selection. Sometimes descending and feature selection also increases accuracy.

dimensionality reduction is usually using PCA, principal component analysis. Intuitively, it is to make a linear combination of several variables, to become a variable, feature selection is relatively simple, is the selection of strong correlation characteristics.

Of course, the PCA design to the matrix singular value decomposition, the specific mathematical principle will not unfold.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data preprocessing and data screening of machine learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support