from:http://blog.csdn.net/u014595019/article/details/52989301
Recently looking at Google's deep learning book, see the Optimization method that part, just before with TensorFlow is also to those optimization method smattering, so after reading on the decentralized, mainly the first-order gradient method, including SGD, Momentum, Nesterov Momentum, Adagrad, Rmsprop, Adam. Where Sgd,momentum,nesterov Momentum is manually specified for the learning rate, and the latter Adagrad, Rmsprop, Adam, can automatically adjust the learning rate.
Second-order method at present my level is too poor, can not understand .... don't put it up. BGD
That is, batch gradient descent. In training, every step of the iteration uses all the content of the training set. That is to say, using the existing parameters to generate an estimated output yi^ for each input in the training set, and then compare with the actual output Yi, statistics all errors, averaging after the average error, as a basis for updating parameters.
Specific implementation:
Required: Learning rate Ε, initial parameter θ
Each step of the iteration process:
1. Extract all contents of the training set {x1,..., xn}, and related output yi
2. Calculate gradients and errors and update parameters:
g^←+1n∇θ