I've been having some trouble with my CNN network lately. Use C + + to write directly from scratch, in accordance with the hierarchical modularization. To do after the massive dynamic incremental learning of CNN. Write code, debugging, the results are basically the same as the expected difference, after all, there are many parameters to note. Missing a parameter symbol anywhere can cause the network to fail. So the BP network checked for errors from the beginning.
1, use 1 dimension input value, do 3 layer BP network test.
1> 2 classification, two-output node situation. Fast convergence of the network. Sample format <x, Y>, x is input value, y is label category.
<0, 0>; <1, 1>
SIGMOD activation. Network Convergence.
Relu activation, the network fast convergence.
Notice a trap here, the Relu activation function can only be used for layers outside of the output layer. The last layer must use the SIGMOD function, otherwise the network of complex data does not converge, or divergence. 2 samples of the output layer using Relu will also converge faster. Testing many times before discovering that the last online search to also have a few articles mentioned. Most of the time, most people are using the output layer directly as Softmax, so there is no chance of encountering this troublesome thing I said.
2> 2 classification, two-output node situation. 3 samples.
<0, 0>;<0.5, 1>; <1, 0>
sigmod function activation. Network Convergence.
Relu activation. The network also converges. But the sigmod situation also increases the learning rate to 0.5 or even 0.9. In the case of Relu, the activation value is greater than 0.15 and the network does not converge or diverge.
3> 2 classification, two-output node situation. 4 samples. To test out strange phenomena.
<0, 0>;<0.3, 1>; <0.6, 0>; <1, 1>
sigmod function activation. Network Convergence. Relu activation. The network does not converge. No matter how to revise the learning rate, check the wrong. After training, the network is divergent.
To this end, a special online search for the simplest BP 3-layer network program, modified for Relu activation, still converge. Repeatedly compare the code for logic errors. were not able to find the anomaly. Various changes in numerical size, number of neurons, etc., in order to find possible causes. Lasted a few days for nearly a week, and there was no result. It also found that the structure was almost as logical as the simple BP network that was downloaded online. Finally, it is only noted that the initialization of network parameters appears to be different in order.
The initialization of my network usage is dynamic initialization during use. This situation initializes the W weight and b bias of a neuron each time. The simple BP network is a fixed network, the program begins to initialize once, and all the neuron weights are initialized horizontally, not the connection weights that initialize one neuron at a time. It's a layer of neurons, each of which initializes the first weight of each neuron, then iterates the first weight, and then the third ... The bias B of the same layer of neurons is also the first initialization of a vector.
So on the simple BP network, after modifying the initialization method, the discovery also becomes divergent ....
I was born out of mathematics and studied on my own. I'm not very good at math, so I haven't found the reason for the moment. Look on the net said Relu case, the use of Gaussian random number initialization effect will be better. So want to try this, whether or not with the initialization of the horizontal vertical relationship. Are randomly initialized, the difference is so big ... And is sigmod to this no influence, relu situation to this absolutely fatal ....
2016-5-19:
Repeat the test again for the above problem. Discovering that the initialization order is slightly different, the relu may diverge, and sigmod will not. It's too dangerous to train a neural network. After rethinking, the sample sent into the neural network is also randomized. So how to initialize the weight and bias, seemingly there will be no problem. And the speed of convergence is accelerated.
Testing again seems to be a problem ...
Do not know if there is no master to analyze this situation ....
2016-5-20:
Online search, no fruit ...
After many thoughts, decided to add a layer of network, the last two-layer network activation function using SIGMOD, the other level using Relu. This seemingly satisfying method of traditional BP networks, while satisfying the relu of the deep network to propagate the activation. Compared to most of CNN, it's just observation, and it turns out that there's really more than one sigmod.
After this modification, the network converges successfully.
2016-6-16:
1relu + 1 Sigmod + 1sigmod, using 5 numerical values of two types of cross arrangement. Still does not converge. Relu sigmod are not convergent.
Adding batchnormalize Relu does not converge, the Relu is modified to sigmod convergence. And with BN, it's fast. Tens of thousands of times before the training, now hundreds of times. Sometimes it converges dozens of times. The speed is 10 times times more than a hundredfold.
However, there is no relu, multi-layer training will not be the problem of training does not move ....