Original address: http://sebastianruder.com/optimizing-gradient-descent/

If you are familiar with English, it is highly recommended to read the original text, after all, the translation process because of limited personal understanding, there may be errors, but also hope that readers can not hesitate to point out. In addition, because the original text is too long, divided into two parts of the translation, this article is mainly a summary of the gradient descent optimization algorithm, the next chapter will be the random gradient parallel and distributed, and the summary of the optimization strategy.

Gradient descent is one of the most popular algorithms in optimization, and is the most commonly used method to optimize neural networks. At the same time, each excellent deep learning library includes implementations of various algorithms that optimize gradient descent (for example, lasagne, Caffe, and Keras documents). However, these algorithms are generally encapsulated as optimizations, such as black boxes, so it is difficult to obtain explanations of their actual capabilities and shortcomings.

The goal of this blog is to provide readers with a visual explanation of the different gradient descent optimization algorithms that readers can apply. We will first understand the different variants of gradient descent. The training process is then briefly summed up in the questions. Next, we will introduce the most common optimization algorithms, demonstrating their motivations for solving these problems, and the reasons they correspond to the changes in the update rules. We will also briefly review the algorithm and architecture for gradient descent optimization in parallel and distributed scenarios. Finally, we'll talk about other strategies that help optimize gradient descent.

Gradient descent is minimized with model parametersTheta∈RDTarget function for Θ∈RD constructionJ(Theta)A method of J (θ), which is based on the target function?ThetaJ(θ)θj (θ) updates the parameter in the opposite direction of the parametric gradient. The learning rate η-η determines the size of the number of steps we need to reach (local) minimum. In layman's terms, we will go down the slope of the surface that the objective function constructs, until we reach a trough. If you are unfamiliar with gradient descent, you can refer to this introductory introduction to the optimized neural network.

Gradient Descent of different versions

A total of three different versions of the gradient drop, their different words ah, and we calculate the target function gradient when the use of data how much. Depending on the size of the data, we weigh the accuracy of the parameter update and the time it takes to update it.

Batch gradient descent

The most common gradient descent, the batch gradient descent, uses the entire training data to Calculate the gradient of the objective function based on the parameter θ θ:

Theta=Theta? η?? θj< Span class= "Mjx-char mjxc-tex-main-r" > ( θ ) θ=θ?η?? Θj (θ)

Because we need to calculate the entire data set gradient to update, the batch gradient drop is very time-consuming, and facing the data set that cannot be completely put into the content, processing is also tricky. The batch gradient update also does not allow us to add new samples at runtime to model updates.

In the form of code, batch gradients fall in the following form:

Range (nb_epochs): params-learning_rate * Params_grad

For the number of pre-set training iterations, we first `params`

calculate the gradient vector of the loss function based on the parameter vectors for the entire data set `weight_grad`

. Note the latest Deep learning library provides a method for automatic differentiation, which allows for efficient calculation of gradients based on parameters. If you make the differential of the gradient yourself, then it's best to do a gradient check. (from this article you can get some tips on how to check gradients reasonably.) ）

The code snippet for SGD adds a loop only when the sample is trained, and estimates the gradient based on each sample. Note that we will shuffle the training data randomly each time we update the training, which is explained later:

Range (nb_epochs): np.random. Shuffle (params-learning_rate * Params_grad

Challenge

However, the traditional mini-batch gradient decline does not guarantee good convergence, but there are some challenges that need to be emphasized:

- It is difficult to choose a suitable learning rate. The learning rate is too small to converge so slowly, and the learning rate is too large to hinder convergence, resulting in loss function in the minimum value of the attachment fluctuation, or even divergence out.
- Learning rate Scheduling 11 attempts to use methods such as simulated annealing can be trained in accordance with predefined scheduling methods, or when the target changes in two training under the threshold value, you can automatically adjust the learning rate. However, these scheduling methods and thresholds need to be defined in advance and therefore cannot be applied to the dataset's feature 10.
- In addition, the same learning rate is applied to all parameter updates. If our data is very sparse and features have completely different frequencies, we may not want to update them in the same way, and we prefer to have large updates on a few of the features that appear.
- Another key challenge is to avoid falling into a large number of local minimums when minimizing non-convex error functions that are common in neural networks. Dauphin et al. claims that the difficulty is not caused by the local minimum, but by the saddle Point, where the saddle point is the point where one dimension is uphill and the other dimension is downhill. These saddle points are generally surrounded by the same error values that are stable, which makes it difficult for SGD to escape from the saddle point because the gradient is close to zero in each dimension.

Gradient Descent optimization algorithm

Next, we will list some of the algorithms that the deep learning community is widely used to deal with the previously mentioned challenges. We will not discuss algorithms that cannot actually handle high-dimensional datasets, i.e., second-order methods, such as Newton's method.

**Ii. interpretation of relevant parameters in Solver.prototxt:**

The Epoch:1 epoch is to train all the training images through the network one at a time.

For example: If there are 1280000 pictures, batchsize=256, then 1 epochs need 1280000/256=5000 times Iteration

It's max-iteration=450000, then there's a 450000/5000=90 epoch.

And when LR attenuation is related to stepsize, how much reduction is related to gamma, namely: if stepsize=500, base_lr=0.01, gamma=0.1, when iterating to the first 500 times, LR first attenuation, decay after the Lr=lr*gamma =0.01*0.1=0.001, repeat the process later, so

Stepsize is the attenuation step of LR, and gamma is the attenuation factor of LR.

In the training process, each to a certain number of iterations will be tested, the number of iterations is determined by test-interval, such as test_interval=1000, the training set 1000 times per iteration of the network, and

The number of test_size, test_iter, and test pictures determines how the test, Test-size determines the number of input images per iteration of test, Test_iter is the iteration count of all the images in test, such as: 500 test pictures, Test_iter=100, Test_size=5, and the Solver document only needs to set Test_iter according to the total number of test pictures, and set test_interval as needed.

Momentum: Momentum (also known as momentum attenuation coefficient)

Weight_decay: Coefficient of regularization penalty term

Summary of Gradient optimization algorithm (reprint) and interpretation of relevant parameters in Solver