Data mining process: Data preprocessing

Last Update:2015-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://www.itongji.cn/article/0Q926052013.html

InData AnalysisBefore, we usually need to standardize the data (normalization), using standardized data toData Analysis。 Data normalization is also the exponential of statistical data. Data standardization processing mainly includes two aspects of data and chemotaxis processing and dimensionless processing. Data with chemotaxis mainly solve the problem of different properties of data, to different properties of indicators directly add total can not correctly reflect the comprehensive results of different forces, we must first consider changing the inverse index data properties, so that all indicators of the evaluation scheme of the Force and chemotaxis, and then add total to get the correct results. Data dimensionless processing mainly solves the comparability of data. There are many ways to standardize data, such as "Min-Max Standardization", "Z-score standardization" and "Standardization by decimal calibration". After the above standard processing, the original data are converted to dimensionless index evaluation value, that is, each index value is at the same quantity level, can carry on the comprehensive evaluation analysis.
the normalization process of data is also a normalization process. The standardization of data (normalization) is to scale the data proportionally to a small, specific interval. In some comparison and evaluation of the indicator processing is often used to remove the unit limit of data, and convert it to dimensionless pure value, so that different units or magnitude of the indicators can be compared and weighted. The goal of data normalization is to unify data from different sources into a single reference system, which makes sense in comparison.
1 DefinitionsNormalization is to limit the amount of data you need to process (through an algorithm) to a certain extent that you need. First normalization is for the convenience of data processing in the back, followed by the maintenance program running convergence speed.
2 Why should we use normalization? First of all, a concept called singular sample data, the so-called singular sample data refers to the other input sample is particularly large or very small sample vectors. The following examples: m=[0.11 0.15 0.32 0.45 30; 0.13 0.24 0.27 0.25 45]; The fifth column of data can be a singular sample data (network mean BP, as described below) relative to the other 4 columns. Strange sample data is caused by the increase of network training time, and may cause the network can not converge, so the training sample has a singular sample data set before training, preferably advanced form normalization, if there is no singular sample data, it does not need to be normalized beforehand.
3 Normalization MethodThere are mainly the following, for your reference: (by James) (1) Linear function conversion, the expression is as follows: y= (X-minvalue)/(Maxvalue-minvalue) Description: X, y are the values before and after the conversion, MaxValue, The MinValue are the maximum and minimum values for the samples, respectively. In statistics, the specific role of normalization is to summarize the statistical distribution of uniform samples. Normalization between 0-1 is the probability distribution of statistics, and normalization between -1--+1 is a statistical coordinate distribution. (2) Logarithmic function conversion, the expression is as follows: Y=LOG10 (x) Description: 10-based logarithmic function conversion. Log analysis, will be the original absolute time series normalized to a certain datum moment, forming a relative time series, easy to troubleshoot. The 10-based log function can also be converted to the same way, the specific method can be as follows: Read the next online many introductions are X ' =log10 (x), in fact, there is a problem, this result does not necessarily fall to the [0,1] interval, should also be divided by log10 (max), Max as the sample The maximum value of this data, and all data is greater than or equal to 1. (3) Inverse cotangent function conversion, the expression is as follows: Y=atan (x) *2/pi normalization is to speed up the convergence of the training network, can not be normalized (4) z-score Standardization (Zero-mean normalization) is also called standard deviation standardization, processed Data conform to the standard normal distribution, that is, the mean value is 0, the standard deviation is 1, the conversion function is: where μ is the mean of all sample data, Σ is the standard deviation of all sample data.
4 in MATLAB, there are three ways to use normalization:(1) Premnmx, Postmnmx, Tramnmx (2) prestd, POSTSTD, TRASTD (3) are programmed with MATLAB language. Premnmx refers to the return to the [-1 1];prestd to the unit variance and the 0 mean; about yourself programming is generally attributed to [0.1 0.9].
5 Note
Need to explain the matter is not any problem must be in advance to standardize the original data, that is, data normalization is not necessary to do, to specific problems, testing shows that sometimes standardized prediction accuracy is much lower than the lack of standardized prediction accuracy. For the maximum minimum value method, When you normalize the raw data in this way, it actually means that you admit that the maximum (minimum) value of all the feature components of each pattern of the test data set is not greater than the maximum (minimum) of all the feature components of each pattern of the training dataset, but this hypothesis is obviously too strong, This is not necessarily the case. The use of the mean variance method also has similar problems. Therefore, the data normalization this step is not necessary to do, the specific problem to be seen. Normalization first in the case of a very large number of dimensions, you can prevent a certain dimension or some of the dimensions of the data impact too much, and then the program can run faster. Methods are many, min-max,z-score,p norm, etc., how to use, according to the characteristics of the data set to choose.

(Editor: Heiyang)

Data mining process: Data preprocessing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data mining process: Data preprocessing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data mining process: Data preprocessing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support