Bowen content reproduced: http://blog.csdn.net/ybdesire/article/details/51792925
Optimization Algorithm
To solve the optimization problem, there are many algorithms (the most common is gradient descent), these algorithms can also be used to optimize the neural network. Each depth learning library contains a large number of optimization algorithms to optimize the learning rate, so that the network with the fastest training times to achieve optimal, but also to prevent the fit.
Keras provides such optimizer [1]: SGD: Random gradient descent sgd+momentum: Momentum based SGD (optimized on SGD) Sgd+nesterov+momentum: SGD based on momentum, two-step update (in sgd+ Momentum based on optimization) Adagrad: adaptively allocate different learning rates for each parameter Adadelta: optimized algorithm (optimized on Adagrad basis) for Adagrad problem Rmsprop: For cyclic neural networks (Rnns) is the best optimizer (optimized on a adadelta basis) Adam: COMPUTE adaptive learning rates for each weight (optimized on a rmsprop basis) Adamax: How to choose an optimized algorithm for Adam (optimized on Adam basis)
There are so many optimization algorithms, so how do we choose it. The great God has given us some advice [2][3] If you have a small amount of data input, choose an adaptive learning rate method. This way you don't have to tune the learning rate, because your data is small, and NN learning is a little time-consuming. In this case you should be more concerned about the accuracy of network classification. Rmsprop, Adadelta, are very similar to Adam and perform well in the same situation. Bias checking makes Adam's effect a little better than Rmsprop. The sgd+momentum algorithm with good parameter tuning is better than Adagrad/adadelta
Conclusion: Until now (2016.04), if you don't know which optimization algorithm to choose for your neural network, simply choose Adam. (Insofar, Adam might be the best overall choice.) [2]) reference [1] keras optimization algorithm, http://keras.io/optimizers/[2] Gradient descent optimization summary, http://sebastianruder.com/ optimizing-gradient-descent/[3] mnist the optimal conclusion on the dataset, http://cs.stanford.edu/people/karpathy/convnetjs/demo/ trainers.html [4] http://blog.csdn.net/luo123n/article/details/48239963