Several variants of the Logistic regression

Source: Internet
Author: User

Original: http://blog.xlvector.net/2014-02/different-logistic-regression/

In recent years, advertising system has become one of the important systems of many companies, targeted advertising technology is an important technology in the advertising system, CTR Estimation is an important part of the targeted advertising technology, the Logistic regression is the most commonly used machine learning algorithm to solve the CTR estimation. So this article describes the logistic Regression (hereinafter referred to as LR).

The problem solved

LR is mainly used to solve two kinds of classification problems. The following problems are typical of two types of classification problems:

    1. When a user sees an ad, it will point or not.
    2. Whether a man is a man or a woman
    3. The image in a picture is not a human face
    4. Will a person who borrows money still

The problem of two kinds of classification is the basic problem of machine learning, and all classification algorithms can at least solve two kinds of classification problems, such as:

    1. Decision tree, Random forest, GBDT
    2. SVM, Vector Machine
    3. Gauss Process
    4. Neural network

So why does the CTR estimate problem choose LR, mainly because:

    1. Data size is large, and LR is very low in terms of the computational complexity of training and forecasting
    2. Features are many, the characteristics of the linear transformation, so the problem is basically linear, linear classifier can solve
    3. LR can predict not only what kind of a type this belongs to, but also the probabilities that belong to each class.
    4. The LR model is simple enough to explain the predicted results.
    5. The LR model is simple, which makes parallelization relatively easy
Different types of LR

Since LR was introduced, the academic improvement in it is based on two main aspects:

    1. With what regularization, early is L2 regularization, and recently used more is L1 regularization
    2. With what optimization algorithm, how to converge to the optimal solution in the shortest period of time
Regularization

Regularization is an important technique in machine learning, and its main purpose is to prevent a model from overfitting. At present, the more commonly used regularization has L1, and L2:

    1. L2 regularization that the prior distribution of the weight of a feature is a Gaussian distribution around 0.
    2. L1 regularization that the prior distribution of the weight of a feature is a Laplace distribution around 0

L1 regularization relative and L2 regularization has one advantage, is to join the L1 regularization of the loss function after optimization, the majority of the characteristics of the weight is 0. This feature can significantly reduce the memory footprint of online estimates and increase the speed of predictions because

    • The characteristic vector x of the main calculation sample on-line and the point multiplication of the model's characteristic weight vector w
    • W vectors generally need to be stored with HashMap, and a feature with a weight of 0, does not need to be stored, because HashMap does not exist in the feature is the weight of 0
    • So L1 regularization can reduce the memory consumption of W, while W decreases, the speed of calculating W and X will also increase.
Optimization method

The loss function of the L2 regularization LR is a convex function that can be derivative, which can be optimized by the steepest descent method (gradient method). There are 3 kinds of general gradient method

    1. Batch
    2. Mini Batch
    3. SGD (random gradient method)

These 3 methods are the first proposed optimization methods. By using the gradient method, the Newton method can be used to obtain the characteristic of super linear convergence, so the conjugate gradient method and the Lbfgs are also used to optimize LR. LBFGS is based on L2 regularization, if based on L1 regularization, Microsoft proposed OWLQN algorithm (http://blog.csdn.net/qm1004/article/details/18083637).

Both the gradient method and the quasi-Newton method are both optimized for the frequency school. They are in fact maximum likelihood estimates using different optimization algorithms. Therefore, Bayesian school also proposed the optimization algorithm of Bayesian

    • Ad Predictor: This is an algorithm proposed by Microsoft Researcher, the paper can refer to Web-scale Bayesian Click-through rate prediction for sponsored Search advertising I N Microsoft ' s Bing Search Engine.

Ad Predictor has several better features

    1. It only needs to scan the data set to converge to the optimal solution, instead of iterating over the data set like the gradient method or quasi-Newton method.
    2. It can not only predict the probability that a sample is a positive sample, but also give the confidence of the probability prediction value.

Ad Predictor is good, but it is based on L2 regularization, which is always unsatisfactory. Google published a paper in 2013 (AD Click prediction:a View from the trenches), introduced a L1 regularization based LR optimization algorithm ftrl-proximal, and has the above Ad Two advantages of predictor.

Parallelization of

There are two kinds of parallelization of algorithms

    1. Lossless parallelization: The algorithm can be parallel in nature, parallel only increases the speed of computation and solves the problem, but it is the same as the result of normal execution.
    2. lossy parallelization: The algorithm itself is not natural parallel, need to do some approximation of the algorithm to achieve parallelization, so that after parallelization and normal execution of the results are not consistent, but similar.

In the algorithm mentioned earlier, Batch-based algorithms (BATCH-GD, LBFGS, owlqn) can be parallelized in a lossless format. The SGD-based algorithm (Ad Predictor, Ftrl-proximal) can only perform lossy parallelization.

Several variants of the Logistic regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.