Data normalization based on data preprocessing

Source: Internet
Author: User

1. What is data standardization?

Data normalization is a method of data transformation in Data Mining. Data Transformation converts or unifies data into a suitable form for data mining. Data normalization refers to proportional scaling of the property data of the mined object to make it fall into a small specific range (such as [-1, 1] or []).

Ii. Functions of Data Standardization

Normalization of attribute values is often used for classification involving neural networks or distance measurements.AlgorithmAnd clustering algorithms. For example, when a neural network back-propagation algorithm is used for classification mining, normalization of the input values that measure each attribute in the training tuples will help speed up the learning phase. Data normalization allows all attributes to have the same weight for distance-based measurements.

Iii. Three Methods of data standardization

There are three common methods for Data normalization: normalization by decimal number, normalization by minimum-maximum value, and normalization by Z-score.

1,Scale by decimal number

Normalization is performed by moving the decimal point of the attribute value. Generally speaking, the attribute value is divided by the J power of 10. Formula:

Where J is the maximum integer that makes max (|) <1. For example, assume that the value range of attribute a is a [-986,917]. Then the maximum absolute value of A is 986. Obviously, as long as the values in attribute a are divided by 1000Max (|) <1.J = 3.-986After normalization-0.986, And917Normalized0.917. It reaches a specific range that compresses the attribute value to a small value.[-1, 1].

Advantages: Intuitive and simple.

Disadvantages: The weight difference between attributes is not eliminated.

2. Normalization of the minimum-Maximum Value

Minimum-The maximum value normalization linearly changes the original data. HypothesisAnd
AttributeA. Minimum-The maximum value normalization formula is as follows:

Indicates the object in the formula.IThe original attribute value,Indicates that the attribute value is obtained after normalization.[,]IndicatesAThe range in which all values of an attribute fall into after normalization.

example: assume the employee's salary in the company income 12000 USD and 98000 Income ing to a range [0, 1] . For income value: 73600 -

According to the description,= 12000,= 98000,After normalization, the value range of the attribute is [0, 1], that is, = 1, = 0. The formula is used for calculation:


Advantages:You can flexibly specify the value range after normalization to eliminate the weight difference between different attributes.

Disadvantage: You need to know the maximum and minimum values of this attribute in advance. On the other hand, this method maintains the relationship between the original data values. If future input falls outside the original data value range, this method will cause an "out-of-bounds" error. Sensitive to outlier. (The outlier only deviates from the maximum and minimum values of the center level)

3. Z-score normalization

This method is normalized based on the mean and standard deviation of the attribute. Calculation formula:


It indicates the original property value of object I,
Indicates the normalized attribute value,
The average value of attribute a, indicating the standard deviation of attribute a. The calculation formula is as follows:


Example application:Assume that the mean and standard deviation of the attribute "income" are USD 54000 and USD 16000 respectively. Use Z-score normalization and convert the value of $73600:

When normalization is performed for an outlier, the mean absolute deviation can be used instead of the standard deviation for normalization to obtain better robustness. Formula for Calculating mean absolute deviation:

Advantages:You do not need to know the maximum and minimum values of a dataset, which has a good effect on the Normalization of outlier points;
Disadvantages:High computing complexity.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.