Data preprocessing and data screening of machine learning

Source: Internet
Author: User

Data mining and machine learning, in fact, most of the time is not in the algorithm, but in the data, after all, the algorithm is often ready-made, the room for change is very small.

The purpose of data preprocessing is to organize the data into a standard form.

1. Normalization

Normalization usually takes two methods.

A. Simplest normalization, maximum minimum value mapping method

p_new= (P-MI)/(MA-MI)

P is the original data, MI is the minimum value in this attribute, MA is the maximum value in this attribute. After this is processed, all values are limited to 0-1.

B. Standardization of standard deviation

P_new= (P-avg (P))/sd (p)
where AVG (p) is the variable mean, SD (p) is the standard deviation.

One of the benefits of this approach is that when you find that you are dealing with something that is bizarre, you can think of it as outliers and reject it directly.

2. discretization

If your numbers are sequential, sometimes it's not that good to handle, like age. It is more meaningful to scatter numbers into children, teenagers, youth, etc.

3. Missing value problem

The first thing to consider is the number of missing values, if too much, rather than directly delete the attributes, if within the acceptable range, the use of average, maximum or other suitable scheme to complement.

There is, of course, a way to model a record that is not missing using Method 1, and then use that method to predict the missing value, and then use Method 2 to finally model it. Of course, there are a lot of problems here, such as the accuracy of method one, Method 1 and Method 2, which generate information redundancy when using the same method.

4, abnormal data points

Many of the actual data sets are abnormal data, which may be due to typographical errors or interference from the acquisition. The most commonly used methods to eliminate abnormal data are the following two types.

Finding nearby points is considered an anomaly when the distance from the nearest point is greater than the threshold value. It is also possible to think of an anomaly within a limited distance, with less than a certain number of data points.

The former is based on the distance, the latter is based on density. Of course, you can also combine the two, specify the distance and also specify the number, which is called COF.

5, the Data screening

After we have preprocessed the data, sometimes the dimension of the data is very large, for economic reasons, of course, the need for dimensionality reduction or feature selection. Sometimes descending and feature selection also increases accuracy.

dimensionality reduction is usually using PCA, principal component analysis. Intuitively, it is to make a linear combination of several variables, to become a variable, feature selection is relatively simple, is the selection of strong correlation characteristics.

Of course, the PCA design to the matrix singular value decomposition, the specific mathematical principle will not unfold.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.