What is an optimization algorithm.
Given an objective function with parameter θ, we want to find a theta that allows the target function to get the maximum or minimum value. The optimization algorithm is the algorithm that helps us find this theta.
In neural networks, the objective function f is the predictive value and the label error, and we want to find a theta that makes F the smallest. types of optimization algorithms First Order optimization algorithm
It minimizes the cost function by calculating the gradient (first derivative) of the objective function f about the parameter θ. The commonly used optimization algorithms of SGD, Adam, Rmsprop and so on are belong to the first order optimization algorithm.
The difference between gradient gradient and derivative derivative is that the former is used for the objective function of multivariable and the latter for the objective function of a single variable.
Gradients can be represented by the Jacques ratio matrix, and each element in the matrix represents the first derivative of the function for each parameter. Second order optimization algorithm
It minimizes the cost function by calculating the second derivative of the object function f to the parameter θ. The usual Newton iterative method is used.
The second derivative can be represented by the Hessian matrix, and each element represents the second derivative of the function for each parameter. contrast
First-order optimization algorithm only needs to calculate the first derivative, the calculation is easier
Although the second order optimization algorithm is computationally complex, it is not easy to get into the saddle point. gradient descent method and its variants
The most commonly used in neural networks is the gradient descent method (first-order optimization algorithm), which is based on the following form:
Θ=θ−η⋅∇j (Theta)
η for learning rate
∇j (θ) is the gradient of a cost function, J (θ), to a parameter θ
Batch Gradient Descent
An update of one parameter requires all samples to be input. When the sample size is too large, it easily fills up with memory and does not support online update. Stochastic gradient descent (SGD)
To a sample, the implementation of a parameter update, the amount of calculation is greatly reduced, support online update. The disadvantage is that the parameter update frequency is too high, the parameter fluctuation is big, has the high variance ( concrete explanation see at the end of the article). The following figure:
When the learning rate is too large, it is easy to adjust the parameters excessively. Therefore, the use must ensure that learning rate not too big. Mini Batch Gradient Descent
The tradeoff between the first two methods is to perform a parameter update once with a mini batch (usually a 50~256 sample) as input. The advantage is that it reduces the high variance of the parameter update; Because of the use of the mini batch, it is possible to use quantization programming to improve computational efficiency.
This is the most commonly used optimization method in neural networks at present. upgraded version of Gradient descent method
Several of the above methods have a common disadvantage:
1. The setting of the learning rate is sensitive, too small to train too slow, too large is easy to make the objective function divergence.
2. For different parameters, learning rate are the same. This is especially inconvenient for sparse data, because we want to use smaller step size for those that are often present, and a larger step size for the less-than-rare data.
3. the essence of the gradient descent method is to find the fixed point (the point where the objective function is the derivative of the parameter is 0), and the fixed point usually includes three classes: maxima, minima, and saddle points . There are a lot of saddle points in the space of the high-dimensional non convex function, which makes the gradient descent method easy to fall into the saddle point (saddle points) and does not come out for a long time, as shown in the following left picture:
Attention:
Falling into the saddle point does not mean that there is no real move, and some gradient descent methods, such as SGD or nag, can jump out of the saddle after a long enough training time. The following is the right figure:
Momentum
This method is used to solve the problem of high amplitude oscillation of SGD. The acceleration parameter changes in the main direction, weakening the change of the parameter in the non main direction.
Parameter Update method:
Similar to the conjugate gradient method, the current gradient direction is offset by the use of historical search methods to counteract the back and forth oscillations in the non main direction.
SGD without momentum
SGD with Momentum
The disadvantage of the Motentum method is that the momentum in the downhill process is getting bigger, the speed at the lowest point is too high, and it may run uphill and lead to miss the minimum. Nesterov Accelerated Gradient (NAG)
is the improvement of the motentum algorithm. The algorithm increases the foresight ability, estimates the gradient of the next parameter in advance, and adjusts the gradient of the current calculation.
Parameter update:
First step (brown vector) along the previous direction (γvt−1) and then stand there to compute the gradient (∇θj (θ−γvt−1)) (red vector), so the corrected gradient direction is γvt−1+η∇θj (θ−γvt−1) (green Vector
and the momentum method.
The gradient (Η∇θj (theta)), (small blue vector), is first computed at the current position, and then a big step (Γvt−1+η∇θj (theta)) (the Big Blue vector) combined with the last search direction (γvt−1). Adagrad
The first realization of adaptive learning rate adjustment. That is, different parameters have different learning rates. The large gradient parameter compensation is smaller, the gradient small parameter step is bigger.
Parameter Update method:
The gradient of the objective function for each parameter:
Gt,i=∇θj (θi) g_{t, i} = \nabla_\theta J (\theta_i)
Different than SGD:
Θt+1,i=θt,i−η⋅gt,i \theta_{t+1, i} = \theta_{t, i}-\eta \cdot g_{t, i}
Adagrad through the denominator to achieve different parameters with different learning rate purposes:
Θt+1,i=θt,i−ηgt,ii+ϵ−−−−−−−√⋅gt,i \theta_{t+1, i} = \theta_{t, i}-\dfrac{\eta}{\sqrt{g_{t, II} + \epsilon}} \cdot G_{t, I
Where the Gt∈rdxd g_{t} \in \mathbb{r}^{d \times d} is a diagonal array, each diagonal element (i,i) represents the sum of the squares of all gradients at the time of T, I parameters (the total of the squares of the grad ients).
Can be written in the following quantitative form:
Θt+1=θt−ηgt+ϵ−−−