With the neural network of small partners know that the data needs to be normalized, but why to do normalization, the problem has always been ambiguous, and there is no more than the answer on the net, the small series spent a period of time, made some research, give us a careful analysis, why do normalization:
1. Numerical problems.
There is no doubt that normalization can indeed avoid some unnecessary numerical problems. The magnitude of the input variable does not cause a numerical problem, but it is not so difficult to cause. Because the nonlinear interval of the Tansig is about [ -1.7,1.7]. means that for neurons to be effective, w1*x1 +w2*x2 +b in Tansig (w1*x1 + w2*x2 +b) should be about 1 (1.7 of the order of magnitude). When the input is large, it means that the weights must be smaller, a larger one, a smaller one, and multiply them, causing the numerical problems.
If your input is 421, you may think that this is not a large number, but because the effective weights will probably be around 1/421, such as 0.00243, then, in matlab input 421*0.00243 = = 0.421*2.43
You can see that the two are not equal, indicating that a numerical problem has been caused.
2. Solve the need
Once we have established the neural network model, we can predict the network correctly as long as we find the solution is good enough. Before the training we normalized the data, indicating that the data is to be more convenient to solve.
So what is the convenience of solving the problem?
This problem can not be generalize, different algorithms, in the normalization of the benefits are different. If there is a very cow B solution algorithm, it does not need to be normalized, but the current large-size algorithm, are more needs to be normalized, especially the commonly used gradient descent method (or gradient descent of the derivative method), normalization and non-normalization, the impact of gradient descent method is very large. Different algorithms, for example, are different from the normalized dependent program, such as the Levinburg-ma-trainlm algorithm (MATLAB toolbox), the dependency on normalization has no gradient descent method (Traingd in MATLAB) is so strong.
Since different algorithms have different reasons for normalization, and the space is limited, this article only takes the gradient descent method as an example.
To revisit the gradient method, the gradient method generally initializes an initial solution, then the gradient is obtained, and the new solution = old solution-Gradient * learning rate is used to iteratively update the solution. Exits the loop until the terminating iteration condition is met.
First look at the benefits of normalization for initialization:
(1) initialization
Students who have been initialized will find that the scope of the input data will affect the effect of our initialization. For example, the value of a neuron is tansig (w1*x1+w2*x2+b), because the Tansig function only in the range of [ -1.7,1.7] is relatively good nonlinearity, so the range of w1*x1+w2*x2+b will be the same as [ -1.7,1.7] With the intersection (which actually requires more delicate conditions), the neuron can take advantage of the nonlinear part.
We want to initialize each neuron into a valid state, so we need to know the range of w1*x1+w2*x2+b, and we need to know the range of the input and output data.
The scope of the input data has an unavoidable effect on initialization, and when we discuss the initialization method, we all assume that the scope is [0,1] or [ -1,1], so it's much easier to discuss. In this way, if the data has been normalized, can give the initialization module to bring more simple, clear processing ideas.
Note: The MATLAB Toolbox takes into account the range of data when initializing the weight threshold, so even if your data is not normalized, it will not affect the initialization of the MATLAB toolbox
(2) Gradient
Take the input-hidden layer-output three-layer BP as an example, we know that the input-hidden layer weight gradient has 2e*w* (1-a^2) *x form (E is the error, W is the weight of the hidden layer to the output layer, a is the value of the hidden layer neurons, x is input), if the output layer is a large number of orders, it will cause the magnitude of e is very large, Similarly, W to the hidden layer (order of magnitude 1) into the output layer, W will be very large, plus X is also very large, from the gradient formula can be seen, the three multiplied, the gradient is very large. This will result in a numerical problem for the gradient update.
(3) Learning rate
From (2), know that the gradient is very large, the learning rate must be very small, therefore, the learning rate (the learning rate of the initial value) of the selection needs to refer to the scope of the input, rather than directly to the data normalization, so that the learning rate is no longer based on the data range of adjustment.
The weight gradient of the hidden layer to the output layer can be written as 2e*a, and the weight gradient of the input layer to the hidden layer is 2e *w* (1-a^2) *x, affected by X and W, the order of each gradient is different, therefore, they need the order of magnitude of the learning rate is not the same. W1 Suitable for the study rate, may be too small for the W2, if the use of suitable W1 learning rate, will lead to in the W2 direction is very slow, will consume a lot of time, and the use of suitable W2 learning rate, for W1 is too large, the search is not suitable for W1 solution.
If a fixed learning rate is used, and the data is not normalized, the consequences can be imagined.
However, if you use the adaptive learning rate like the MATLAB Toolbox, the learning rate problem will be slightly mitigated.
(4) Search trajectory
As mentioned earlier, the range of inputs is different, and the effective range of the corresponding W is different. Assuming that the range of W1 is [ -10,10], while the range of W2 is [-100,100], the gradient progresses 1 units at a time, then each time the W1 direction equals 1/20, while the W2 is equivalent to 1/200. In a sense, the step on the W2 is smaller, and W1 will "go" faster than W2 in the search process. This leads to a more W1 orientation in the search process.
Aside from the question of which route is more effective in finding the best solution, the shortest line distance between two points is obviously more time consuming, so the time will be significantly increased without normalization.
From the above analysis, the main effect of removing the numerical problem is that the partial derivative of each dimension will be inconsistent in the order of magnitude. Let's take a test.
3. Small experiments
Suppose we have two input variables, the X1 range is [ -1,1], but X2 is [-100,100] and the output range is [ -1,1]. X2 The input data is not normalized, how to modify the training process, in order to let the training results as the data normalization.
From the above discussion, we know that the X2 is enlarged and will make the gradient of the W2 very large, so we need to divide the gradient by 100 when we calculate the W2 gradient. In order to get its gradient orders of magnitude and w1 consistent. Then, when updating the W step, the effective range of W1 (1/1) is 100 times times the valid range of W2 (1/100), so the W2 should go with 1/100 steps. So W2 's learning rate also needs to be divided by 100.
In this way, if the numerical problem is not considered, it will be the same as the result of normalization of the data. The code for the experiment is not shown here because the entire BP code needs to be involved. Interested students in their own writing code to move the knife.
This is a case study, stating that the numerical problem is not considered, but only affects the two places. Assuming that the input range of the X2 is [100,300], it must not be divided by 100, which requires a more complex transformation, which is no longer entangled in depth.
Why normalization is the reason for the three-layer BP neural network trained with gradient descent. For other neural network models, there are other reasons, which are no longer analyzed.
4. Recommendations for using the MATLAB toolbox
Two points to note about using the MATLAB Toolbox matlab2012b has automatically normalized the input data, so you do not have to do the preprocessing of the data, directly with the original data to establish the network. But the output needs to be normalized, because the toolbox calculates the error, the use of the original data error, so the magnitude of the error may be very large, so that the gradient is very large, in the learning rate has not time to adapt to reduce, the gradient of the original initial good weight to swallow, The weight of the network is dropped to a place very far from the optimal solution. So using MATLAB neural network Toolbox, but also with gradient descent method, the output must be normalized.
But if we use the default TRAINLM method, rather than the gradient descent method Traingd, that effect will not be as serious as TRAINGD, we can see that the TRAINLM (Levinburg-Ma cross-special method) to the direction of H calculation formula is:
Since the order of JJ and JF is not too much, and because of the adjustment of U, you will end up with an appropriate H.
3. The network weights obtained using the matlab2012b (or above) toolbox are oriented towards normalized data. So when using, we need to first normalized the data, and then put the normalized input data into the network to calculate the network output, and then return the output back to one, is the real prediction results. If you want to blend the normalization process to the weight of the network, refer to the article:< extract the Weights and thresholds corresponding to the original data >
5. Personal Insights
Here are some answers from netizens about why they should be normalized (welcome supplement):
1. Avoid numerical problems.
2. Fast convergence of the network.
3. Sample data evaluation criteria are different, need to be dimensional, unified evaluation criteria
The sigmoid function is often used as a transfer function in 4.BP, and normalization can prevent the output saturation of neurons caused by the large absolute value of net input.
5. Ensure that the values in the output data are small and not swallowed.
In fact, most of the research in this paper is based on the views of netizens, which is further explored on the basis of these viewpoints, but I have a slightly different view on some points:
(1) To make the network fast convergence: agree.
(2) Avoid numerical problems: agree.
(3) Unified Dimension: I think this belongs to the business layer, and network training is irrelevant.
(4) Avoid neuron saturation: when multiplied by the weight threshold, it is the input value of the sigmoid, which is initialized well and does not saturate the output. If you use the initialization method of "initialize the weights and thresholds randomly to [ -1,1]", then the output saturation of neurons is caused by the non-normalization.
(5) Large number Swallow decimals: If we find the right value, is not swallowed, such as x1=10000,x2=1, and w1=0.0001,w2=1, then w1*x1 will not swallow w2*x1.
Post-language
In a lot of detail, this article has not been discussed in depth, and the commencement of these seminars makes the article very redundant and loses its theme. Secondly (and the main reason), the study of normalization can only make us clearer the benefits of normalization, reduce our doubts about normalization, and can not promote our better network effect. Therefore, this article only from the large, not very rigorous mention of normalization in the training process in all aspects of the benefits.
Thank you for reading.
==============< original article, reprint please explain from the neural network home www.nnetinfo.com > ================
==========< please keep the address of this article: http://www.nnetinfo.com/nninfo/editText.jsp?id=37>=============