Summary:
We introduce Adam, an algorithm that optimizes the random objective function based on a ladder degree. The meaning of the objective function is that the objective function is different in each iteration of the training process. Sometimes because the memory is not large enough or other reasons, the algorithm does not read all the records at once to calculate the error, but choose to choose to split the dataset, only a subset of records in each iteration to train, this part of the record is called Minibatch, so that each iteration uses a small batch of data set is different , the data set is different, the loss function is different, so there is the argument of the random objective function. Another reason is that the small-batch approach to training can reduce the risk of convergence to the local optimal (imagine a small ball moving on uneven ground, the small ball is easy to get into some pits, these pits are not the lowest point).
Brief introduction:
Adam's name derives from the adaptive moment estimation, an adaptive moment estimation. The meaning of the moment in probability theory is that if a random variable x obeys a distribution, the first moment of X is E (x), which is the sample mean, and the second moment of X is E (x^2), which is the mean of the sample squared. The Adam algorithm dynamically adjusts the rate of learning for each parameter based on the first-order moment estimation of the gradient of each parameter and the second moment estimation of the loss function. Adam is also based on gradient descent, but the learning step for each iteration parameter has a definite range, not because a large gradient results in a large learning step, and the values of the parameters are stable. It does not require stationary objective, works with sparse gradients, naturally performs a form of step size annealing. As I understand it, it helps to reduce the risk of model convergence to local optimization.
The following is a report of the PPT in the lab, about Adam.
Appendix:
Dropout:
Look at an experimental result:
When dropout is useless:
Training sample error rate (mean square error): 0.032355, test sample error Rate: 15.5%
When using dropout:
Training sample error rate (mean square error): 0.075819, test sample error Rate: 13%
It can be seen that after using dropout, although the error rate of training samples is higher, the error rate of training samples is reduced, which indicates that the generalization ability of dropout is good and can prevent overfitting [1]. The advantages of droput are generally shown in less sample cases.
The dropout method is to randomly delete some hidden nodes each iteration. is not really deleted, but the output of these nodes is 0, equivalent to the effect of deletion. At the time of forward propagation, the output value of the hidden layer node is randomly cleared 0 at a certain percentage, and when the error of the node is calculated at the time of the reverse propagation, the error item should be cleared 0. Dropoutfraction is often set to 0.5, which means that about half of the hidden nodes are deleted. Why does it help prevent overfitting? Can simply explain, the use of dropout training process, equivalent to training a lot of only a portion of the hidden layer of small-scale neural network, different networks can share weights, each network is a classifier, can give a classification results, some of these results are correct, some errors. With the training, most of the network can give the correct classification results, then put together these classifiers, a few wrong classifiers will not cause too much impact, you can obtain a more credible classifier.
Data set:
[1]http://www.cnblogs.com/tornadomeet/p/3258122.html
Adam: An immediate optimization method