on gradient descent method
If the reader's guide number and gradient definitions are not well understood, read the previous article, "Number of square guides and gradients."
Prior to the machine learning, it was found that the gradient descent method was the most important one in machine learning. The gradient descent algorithm process is as follows:
1) Random initial value;
2) iterate until it converges. Indicates the negative gradient direction at the place, indicating the learning rate.
here, simply talk about your understanding of the gradient descent method.
First, to clarify that the gradient is a vector , is an n-ary function f on the partial derivative of n variables, such as the gradient of the ternary function f (FX,FY,FZ), the gradient of the two-ary function f (fx,fy), the gradient of the unary function f is FX. Then understand that the direction of the gradient is the fastest growth direction of function f, the inverse direction of the gradient is f to reduce the fastest direction.
We take the unary function as an example to introduce the gradient descent method.
Set F (x) = (x-1) 2+1/2,
The figure above shows the image of the function f and the initial value x0, we want to get the minimum value of the function f, because in the negative gradient direction after moving a small step, the F value is reduced, so just x0 along the negative gradient direction of a small step to move.
And F in the point x0 derivative is greater than 0, so that f in the point x0 gradient direction is positive, that is, the gradient direction is F ' (x0), it is known by the gradient descent method, the next iteration value, that is, x0 to the left to move a small step to X1, the same as the derivative of the X1 point is also greater than 0 All the time, as long as the number of steps to move is not very large, we can get the solution x of convergence 1.
This confirms our verification of the analysis (Blue italic font).
Similarly, if the disposition is selected to the left of the minimum value, as shown in the figure:
Because F ' (x0) <0, so the gradient direction is negative, negative gradient direction is positive, it is necessary to move the x0 along the negative gradient direction of one small step, that is to move one small step to the right, so that the F value smaller. or using the gradient descent method to iterate the formula, we can get the x1,x2,..., xk as shown in the figure, and then,... until the minimum value is convergent.
For the two-tuple function , we can also verify the rationality of the gradient descent method by an example:
Each time we get a point (Xk,yk), we need to calculate (FX (XK), FY (YK)), which represents the fastest growing direction of the gradient F,-(FX (XK), FY (YK)) represents the fastest gradient descent direction, so just (Xk,yk) along-(FX (XK), FY ( YK) to move a small step in this direction, you can reduce the value of f until it converges to the minimum value, as shown in the figure above.
Some points to be noted in the gradient descent method are also the understanding of the gradient descent method:
1) gradient descent does not necessarily converge to the minimum value.
The gradient descent method converges to the local minimum and does not necessarily converge to the global minimum value.
Like what:
Our initial values are selected as shown in Figure x0, since the derivative of f at point x0 is greater than 0, the gradient direction is to the right, the negative gradient direction to the left, thus x0 to the left, and gradually converge to the local minimum, rather than converge to the global minimum value.
2) the size of the learning rate should be moderate.
Learning rate is too small, each move step is too small, convergence is too slow, this is relatively easy to understand.
Learning rate is too large, each move step to grow up, may lead to non-convergence, here with a graph to represent:
As the distance from the smallest point farther away, the greater the derivative, resulting in greater stride size, will not converge.
3) do not necessarily choose negative gradient direction, as long as the value of the direction of descent.
When choosing the direction for each iteration, we simply select the opposite direction of the vector with the angle of the gradient direction less than 90 degrees, not necessarily to choose the negative gradient direction. However, because the vectors satisfying such conditions are not easy to find, we have chosen the inverse direction (negative gradient direction) of the vector with a gradient direction of 0 degrees, and the value of this direction function decreases faster and faster convergence, so it is a good choice.
4) The gradient rise method to find the maximum value.
The gradient direction of F is the fastest growing direction of the value of F. We can gradually converge to the local maximum each time we move in the negative gradient direction, so we can also get the local maximum of the function f in the gradient direction. The iteration formula is:
,
This indicates that the gradient direction at the place is different from the gradient descent method.