In the optimization problem of machine learning, the gradient descent method and Newton method are two common methods to find the extremum of convex function, they are all in order to obtain the approximate solution of the objective function. In the parametric solution of logistic regression model, the improved gradient descent method is generally used, and the Newton method can be used. Since the two methods are somewhat similar, I have come up with a simple comparison of them. The following content requires the reader to familiarize themselves with both algorithms.
Gradient Descent method
The gradient descent method is used to solve the extremum of the objective function. This extremum is found in the parameter space after the given model has given the data. The iterative process is:
As you can see, the gradient descent method updates the parameter by the gradient value of the target function under the current parameter value, preceded by a step control parameter alpha. The gradient descent method is usually shown with a three-dimensional diagram, and the iterative process seems to be going downhill and eventually reaching the bottom of the slope. To be more visually understood, and to be compared with Newton's, here I use a two-dimensional graph to represent:
Don't bother drawing, just show it with this. In a two-dimensional graph, the gradient is equivalent to the slope of the tangent of the convex function, and the horizontal axis is the parameter of each iteration, and the ordinate is the value of the objective function. The process for each iteration is this:
- First, the objective function is computed at the slope (gradient) of the current parameter value, then multiplied by the step factor and then brought into the update formula, where the point is located (to the right of the extreme point), when the slope is positive, then the parameter after the update parameter is reduced, closer to the minimum corresponding parameter.
- If the parameter is updated, the current parameter value is still on the right side of the extremum point, so the effect is the same as the above update.
- If the parameter is updated, the current parameter value is to the left of the extremum point, and then the slope of the calculation is found to be negative, so that after another update, it will be updated in the direction of the extremum point.
According to this process, we find that the distance between each step is very important near the extremum point, if the steps are too large, it is easy to oscillate near the extremum point and cannot converge. WORKAROUND: Set Alpha to a variable that is decreasing as the number of iterations, but it cannot be completely reduced to zero.
Newton's Method
First, it is clear that Newton's method is to solve the problem of the value of the variable when the value of the function is zero, specifically, when the solution f (θ) =0 is required, if f can be guided, then the iterative formula can be
To iterate to get the minimum value. This process is illustrated by a photo.
When applied to the calculation of the maximum likelihood estimate, it becomes the problem of =0 (θ). this is different from the gradient descent, the goal of the gradient descent is to directly solve the objective function minimum, and Newton's Law in disguise by solving the objective function of the first-order zero parameter value, and then obtain the minimum value of the objective function. then the iterative formula is written:
When θ is a vector, the Newton method can be expressed by the following formula:
Where h is called the Haisen matrix is actually the second derivative of the objective function to the parameter θ.
By comparing the iterative formulas of Newton method and gradient descent method, we can find the two and their similarities. The inverse of the Haisen matrix is like the learning rate parameter alpha of the gradient descent method. The convergence velocity of Newton method is very fast compared with the gradient descent method, and because the inverse of the Haisen matrix decreases continuously in iteration, it gradually reduces the step size.
The disadvantage of Newton method is that it is difficult to calculate the inverse of Haisen matrix, which consumes time and computing resources. So we have the quasi-Newton method.
Copyright NOTICE: This article for Bo Master original article, welcome reprint, but please specify the source ~
Comparison of gradient descent method and Newton method in machine learning