Normalization and standardization in data specification:
A. Normalization vs. standardization
Normalization: Limit the amount of data you need to be processed (through some sort of algorithm) to a range you need. First normalization is for the convenience of the back data processing, followed by the maintenance of the program running faster convergence. Generally means to limit the data between [0 1].
The number of the number into (0,1) between, mainly for data processing to facilitate the proposed, the data map to 0-1 processing, more portable fast;
The dimensional expression into dimensionless expression, become a pure quantity;
In general, the maximum-minimum normalization is used for linear transformation of the original data: x*= (x-xmin)/(Xmax-xmin)
Standardization: Scale the raw data and limit it to a certain range. General correction means that the mean value is 0 and the variance is 1. This method can be used even if the data does not conform to the normal distribution, and the normalized data is negative.
Because of the different measure units of the credit index system, in order to be able to participate in the evaluation calculation, it is necessary to normalize the indexes and map them to a numerical range by function transformation.
Data and chemotaxis processing: To solve the data problems of different nature, the direct addition of different property indicators can not correctly reflect the comprehensive results of different forces, we must first consider changing the nature of the inverse index data, so that all indicators of the evaluation scheme of the force of the same, and then add the total to get the correct results
"Dimensionless treatment: To solve the comparability of data;
Generally adopts z-score normalization: that is, the mean value is 0, the variance is normal distribution of 1;
In MATLAB, there are three kinds of methods for normalization:
(1) Premnmx, Postmnmx, Tramnmx. Premnmx refers to the return of one to [1 1],tramnmx is the change test set input results, POSTMNMX is the conversion test set output results.
(2) PRESTD, POSTSTD, trastd. The PRESTD is normalized to the unit variance and the 0 mean value.
(3) programming by oneself. About self programming is generally grouped into [0.1 0.9]
B. Why should I use normalization? Singular sample data refers to a particular large or very small sample vector relative to other input samples. The network training time caused by the singular sample data is increased, and the network can not converge, so the data set with the singular sample data of the training sample is normalized before the training, and if there is no singular sample data, no prior normalization is needed.
C. Return one can also use Mapminmax.
This function can return each row of the matrix to [a b]. The default is [-1 1].
[Y1,ps] = Mapminmax (x1,a,b). Where X1 is a matrix that needs to be normalized, y1 is the result
When the need for another set of data back to the moment, such as training data in SVM with the above method, and test data can be used to do the same in the following method: y2 = Mapminmax (' Apply ', X2,ps)
When you need to restore the data that is returned, you can use the following command: X1_again = Mapminmax (' reverse ', y1,ps)
D.matlab Command Description
1. Mean: Calculating vector mean value. Mean (x,1) column vector mean value, mean (x,2) line vector mean. Mean2 (x) matrix mean value.
2. std: calculated vector mean variance, STD (x,0,1) column vector mean Variance, STD (x,0,2) row vector mean variance. STD2 (x) matrix mean variance
3. var: compute vector Variance, var (x)
4. SSE: Error squared and SSE (x). The closer to 0, the better the fitting and the more successful the data prediction.
5. MSE: Mean variance squared sum, MSE (x) =sse (x)/N. Meaning with SSE
6. R-square: Determine the coefficient. The coefficient of determination is to characterize a fitting by the change of data. By the expression above, we can know that the normal range of "definite coefficients" is [0 1], and the closer to 1, the greater the explanatory power of the variable of the equation to the Y, and the better the data fitting of the model.