Machine Learning optimization Algorithm-L-BFGS

Source: Internet
Author: User

As for the optimization algorithm, many methods have been introduced, such as gradient descent method, coordinate descent method, Newton method and Quasi-Newton method. The gradient descent method is based on the gradient of the objective function, the convergence speed of the algorithm is linear, and when the problem is morbid or the problem is large, the convergence speed is particularly slow (almost not applicable); Although the coordinate descent method does not need to calculate the gradient of the objective function, its convergence speed is still very slow, so its application scope is also limited Newton's method is based on the second-order derivative of the objective function (Haisen matrix), its convergence speed is faster, the number of iterations is less, especially near the optimal value, the convergence rate is two times. But the problem with Newton's method is that when the Haisen matrix is dense, the computational amount of each iteration is relatively large, because the inverse of the Haisen matrix of the objective function is computed each time, so that when the problem is large, it is not only computationally large (sometimes too large to be calculated), but also requires more storage space. Therefore, Newton's method becomes inapplicable in the face of massive data, because of the huge cost of each iteration. The quasi-Newton method introduces the approximate matrix of the Haisen matrix on the basis of Newton method, avoids the inverse of the Haisen matrix in every iteration, and the convergence velocity of quasi-Newton method is very linear between the gradient descent method and Newton method. The problem of quasi-Newton method is also that when the problem is very large, the approximate matrix becomes very dense, it also has great overhead in computation and storage, so it becomes impractical.

In addition, it is important to note that Newton's method can not always ensure that the Haisen matrix is positive definite in every iteration, once the sea-forest matrix is not positive definite, the optimization direction will be "run-off", which makes the Newton method ineffective, also shows that Newton's method of poor robustness. The quasi-Newton method replaces the Haisen matrix with the inverse matrix of the sea-forest matrix, although each iteration is not guaranteed to be the optimal direction of optimization, but the approximate matrix is always positive definite, so the algorithm always searches in the direction of the optimal value.

As can be seen from the above description, many optimization algorithms have good results in theory, and when the optimization problem is small, any of the above algorithms can solve the problem well. But in the actual project, many algorithms have failed. For example, in the actual project, many problems are morbid, so that the gradient-based approach will certainly fail, even if the iteration thousands of times may not converge to a good result, and when the data is large, the Newton and quasi-Newton methods need to preserve the memory cost of the matrix and computing the cost of the matrix is very large, It will also become inapplicable.

In this paper, we will introduce an optimization algorithm to solve the problem of large-scale optimization in practical engineering: L-BFGS algorithm.

It has been mentioned that in the face of large-scale optimization problems, because the approximate matrix is often dense, both in computation and storage are N2 growth, so the quasi-Newton method becomes inapplicable.

The L-BFGS algorithm is an improvement to the quasi-Newton algorithm. Its name has been told that it is based on the improvement of the quasi-Newton method BFGS algorithm. The basic idea of the L-BFGS algorithm is that the algorithm only saves and utilizes the curvature information of the recent m iteration to construct the approximate matrix of the Haisen matrix.

Before we introduce the L-BFGS algorithm, we first briefly review the BFGS algorithm.

In each iteration of the algorithm, there are the following:

, k = 0, 1, 2,..., (1)

Formula (1) in AK is the step size, and HK's update is by the following equation:

(2)

In the formula (2)

(3)

(4)

(5)

(6)

From the formula (2) to the formula (6) It can be seen that hk+1 with {SK, YK} fixed HK to get. It is important to note that here HK represents the approximate matrix of the inverse of the Haisen matrix.

In the BFGS algorithm, as the number of iterations increases more and more dense, when the optimization problem is very large, the storage and calculation of the matrix HK will become impractical.

In order to solve the above problem, we can not store the matrix HK, but instead store the curvature information of the most recent m iterations, i.e. {SK, YK}. Whenever an iteration is completed, the oldest curvature information {si, yi} is deleted, and the latest curvature information is saved. In this way, the algorithm ensures that the stored curvature information is derived from the nearest m iteration. In the actual project, M takes 3 to 20 to have the very good result. In addition to updating the strategy of the Matrix HK and initializing the HK method, the L-BFGS algorithm and the BFGS algorithm are the same.

The update steps for the matrix HK are described in detail below.

In the k iteration, the algorithm obtains the XK, and the stored curvature information is {si, yi}, where i = k-m, ..., k-1. In order to get HK, the algorithm first selects an initial matrix Hk0, which is different from the BFGS algorithm, the L-BFGS algorithm allows each iteration to select an initial matrix, and then use the nearest m-curvature information to modify the initial matrix, thereby obtaining HK.

By using the reusable formula (2), we can get the following formula:

(7)

With regard to the setting of the initial value of Hk0 at each iteration, an effective method that is often used in practice is:

(8)

(9)

Where RK represents a scale factor, it uses the most recent curvature information to estimate the real Haisen matrix size, which makes the current step of the search direction is more ideal, and not run "too biased", so that the step AK = 1 is satisfied most of the time, thus eliminating the step search steps, saving time.

In the L-BFGS algorithm, it is very effective to update the approximate matrix by saving the curvature information of the nearest m-Times.

Although the L-BFGS algorithm is linear convergence, but the cost of each iteration is very small, so the L-BFGS algorithm execution speed is very fast, and because each iteration can guarantee the approximate matrix of positive definite, so the robustness of the algorithm is very strong.

Baidu recently proposed a shooting algorithm, which is 10 times times faster than L-BFGS. Since the iterative direction of the L-BFGS algorithm is not optimal, I suspect that the shooting algorithm should be optimized in the direction of the iteration.

Machine Learning optimization Algorithm-L-BFGS

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.