OWLQN algorithm Introduction, and go language implementation of logistic regression optimization

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

With the logistic regression as the CTR estimate or classification, we should all know that the gradient descent can obtain the optimal solution of the target problem. But the gradient descent iteration is too slow, and the optimal solution may not be obtained.

owlqn not only converges faster than gradient descent algorithm, but also can solve the optimization problem of L1 regularization, L1 regularization makes the objective function in the variable is 0 o'clock not micro. OWLQN is the solution for LBFGS on L1 issues (because LBFGS can only be used to optimize the case where the target function is convex). L1 relative L2 regularization, with the advantage of feature selection. The github link below is the standalone OWLQN algorithm I implemented with go.

Https://github.com/qm1004/OWLQN

Introduction to the theory of the Internet has a very good introduction, the introduction of the algorithm is mainly transferred from the following article.

Turn from

Http://www.cnblogs.com/downtjs/p/3222643.html

OWLQN algorithm

First, BFGS algorithm

The algorithm idea is as follows:

STEP1 takes the initial point, the initial positive definite matrix, the allowable error, the order;

STEP2 calculation;

STEP3 calculations, making

STEP4 order;

Step5 If, the approximate optimal solution is taken; otherwise the next step;

STEP6 calculation

,,

Order, turn STEP2.

Advantages:

1, do not directly calculate the Hessian matrix;

2, the inverse matrix of the Hessian matrix is substituted by an approximate matrix in the iterative way.

Disadvantages:

1, The matrix storage amount is, therefore the dimension is very big when the memory is not acceptable;

2, the matrix non-sparse will result in slow training speed.

Second, L-BFGS algorithm

Aiming at the disadvantage of BFGS, it is mainly about how to reasonably estimate the inverse matrix of a Hessian matrix, and the basic idea of L-BFGS is to save only the most recent m iteration information, thus greatly reducing the data storage space. Against BFGS, I re-organized the formula:

So the estimated Hessian matrix inverse matrix is as follows:

Put

To be brought into the formula:

Assuming that the current iteration is K, only the most recent m iteration information is saved (i.e.: from K-m~k-1), and in turn, to get:

Equation 1:

The second step of the algorithm shows the ultimate goal of the above deduction: finding the feasible direction of the K-iteration, satisfying:

In order to find the feasible direction p, there are the following:

Two-loop recursion algorithm

The correctness of the algorithm is deduced:

1, Order:, recursion into Q:

The corresponding:

2. Order:

So:

The result of this two-loop recursion algorithm is exactly the same as the formula 1* the initial gradient, and the benefits of doing so are:

1, only need to store, (I=1~M);

2. The time complexity of calculating the feasible direction is reduced from O (n*n) to O (n*m), which is linear complexity when m is far less than N.

The steps to summarize the L-BFGS algorithm are as follows:

STEP1: Select initial point, allow error, store last iteration number m (generally take 6);

STEP2:;

STEP3: If the optimal solution is returned, otherwise the STEP4 is transferred;

STEP4: Calculate the feasible direction of this iteration:;

STEP5: Calculate the step size and perform a one-dimensional search for the following:

STEP6: Update weight x:

Step7:if k > m

Only the vector pairs of the most recent m-times are retained and need to be deleted ();

STEP8: Calculate and Save:

STEP9: Using the Two-loop recursion algorithm to obtain:

K=k+1, turn Step3.

Where attention is needed, each iteration needs one, and in practice it is proven to be more effective:

Three, OWL-QN algorithm

1, problem description

For log-linear models such as the logistic regression, it can generally be attributed to minimizing the following problem:

The first is the loss function, which is used to measure the loss when the training is biased, can be any of the micro-convex functions (if the non-convex function of the algorithm only guarantees the local optimal solution), the latter is regularization term, used to limit the model space, so as to obtain a more A "simple" model.

According to the hypothesis that the model parameters obey the probability distribution, regularization term generally has: l1-norm (model parameters obey Gaussian distribution), L2-norm (model parameters obey Laplace distribution), and other distribution or combination forms.

The form of L2-norm is similar to the following:

The form of L1-norm is similar to the following:

One of the biggest differences between L1-norm and L2-norm is that the former can produce sparse solutions, which makes it capable of feature selection at the same time, in addition, the sparse feature weights have more explanatory meanings.

For the loss function selection is not to repeat, look at two pictures:

Figure 1-Red for Laplace Prior, black for Gaussian Prior

Figure 2 Visual interpretation of the generation of sparsity

For the LR model, the loss function chooses the convex function, then the L2-norm form is also the convex function, according to the optimization theory, the optimal solution satisfies the kkt condition, namely:, but l1-norm regularization term is obviously not micro, how to do?

2, Orthant-wise limited-memory Quasi-Newton

OWL-QN is mainly for the l1-norm, it is based on the fact that any given dimension quadrant, L1-norm is micro, because at this time it is a linear function:

Fig. 3 L1-norm after any given quadrant

OWL-QN uses a sub-gradient to determine the direction of the search, convex function is not necessarily smooth and everywhere, but it also conforms to the nature of similar gradient descent, in the multivariate function of this gradient called the sub-gradient, see Wikipedia http://en.wikipedia.org/wiki/Subderivative

As an example:

Figure 4 Secondary derivative

For any x0 in a defined field, we can always make a straight line, which passes through a point (x0, F (x0)), and either touches the image of F or below it. The slope of the line is called the secondary derivative of the function, and the generalization to the multivariate function is called the secondary gradient.

Secondary derivative and sub-differential:

The secondary derivative of the convex function f:i→r at point x0, is the real number C which makes:

For all x within I. It can be proved that the set of sub-derivatives at point x0 is a non-empty closed interval [A, b], where A and b are single-sided limits

              
              

They must exist and satisfy the a≤b. The set of all secondary derivatives [A, b] is called the second derivative of function f in x0.

The difference between OWL-QN and traditional l-bfgs is that:

1), using the concept of sub-gradient to promote the gradient

A virtual gradient that conforms to the above principles is defined, and a virtual gradient is used instead of the gradient in L-bfgs to find the feasible direction of one-dimensional search:

How do you understand this imaginary gradient? See:

For non-smooth convex functions, there are several cases:

Figure 5

Figure 6

Figure 7 otherwise

2), one-dimensional search requirements do not span the quadrant

Requires the same direction as the weight before updating and after the update:

Figure 8 One iteration of OWL-QN

Summarize an iterative process for OWL-QN:

–find Vector of steepest descent

–choose sectant

–find L-bfgs Quadratic approximation

–jump to Minimum

–project back onto Sectant

–update Hessian approximation using gradient of loss alone

The final OWL-QN algorithm framework is as follows:

Compared with L-BFGS, the first step replaces the gradient with a virtual gradient, the second to third step requires a one-dimensional search does not span the quadrant, that is, the point before the iteration is in the same quadrant as the iteration, and the fourth step requires that the Hessian matrix is still used loss The gradient of function (because the existence of l1-norm does not affect the estimation of the Hessian matrix).

Iv. references

1. Galen Andrew and Jianfeng Gao. 2007. Scalable training of l1-regularized log-linear models. In Proceedings of ICML, pages 33–40.

2, http://freemind.pluskid.org/machine-learning/sparsity-and-some-basics-of-l1-regularization/# d20da8b6b2900b1772cb16581253a77032cec97e

3, Http://research.microsoft.com/en-us/downloads/b1eb1016-1738-4bd5-83a9-370c9d498a03/default.aspx

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.