Stochastic Gradient Descent

Source: Internet
Author: User
Document directory
  • 1. Multinomial Logistic
  • 2. Maximum Likelihood Estimate and Maximum a Posteriori Estimate
  • 3. L1-regularized model and L2-regularized model
  • 4. L1-regularized model? Or L2-regularized model?
  • 1. Naive Stochastic Gradient Descent
  • 2. Lazy Stochastic Gradient Descent
  • 3. Stochastic Gradient Descent with Cumulative Penalty
  • 4. Online Stochastic Gradient Descent
  • 5. parallreceived Stochastic Gradient Descent
1. Start from the Multinomial Logistic model 1. Multinomial Logistic

It is a dimension input vector;

Is the output label; (a total of k classes );

Is the model parameter vector;

The Multinomial Logistic model refers to the following form:

\ Begin {cases}
\ Frac {e ^ {\ beta_c \ cdot x }}{ Z_x} \ quad & if \ quad c <k-1 \\
\ Frac {1} {Z_x} \ quad & if \ quad c = K-1
\ End {cases}
\ End {equation} "src =" http://chart.apis.google.com/chart? Cht = tx & chlorophyll = p (c % 7cx % 2c % 5 cbeta) % 3d % 5 cbegin % 7 bequation % 7d % 0a % 5 cbegin % 7 bcases % 7d % 0a % 5 cfrac % 7be % 5e % 7b % 5cbeta_c + % 5 ccdot + x % 7d % 7d % 7bZ_x % 7d % 5 cquad + % 26if + % 5 cquad + c % 3ck-1% 5c % 5c % 0a % 5 cfrac % 7b1% 7d % 7bZ_x % 7d + % 5 cquad % 26 + if + % 5 cquad + c % 3dk-1% 0a % 5 cend % 7 bcases % 7d % 0a % 5 cend % 7 bequation % 7d ">

Where:

For example, the output labels are 0 and 1, which include:

\ Begin {cases}
\ Frac {e ^ {\ beta_c \ cdot x }}{ 1 + e ^ {\ beta_c \ cdot x }}\ quad & if \ quad c = 0 \\
\ Frac {1} {1 + e ^ {\ beta_c \ cdot x }}\ quad & if \ quad c = 1
\ End {cases}
\ End {equation} "src =" http://chart.apis.google.com/chart? Cht = tx & chlorophyll = p (c % 7cx % 2c % 5 cbeta) % 3d % 5 cbegin % 7 bequation % 7d % 0a % 5 cbegin % 7 bcases % 7d % 0a % 5 cfrac % 7be % 5e % 7b % 5cbeta_c + % 5 ccdot + x % 7d % 7d % 7b1% 2be % 5e % 7b % 5cbeta_c + % 5 ccdot + x % 7d % 7d % 5 cquad + % 26if + % 5 cquad + c % 3d0% 5c % 5c % 0a % 5 cfrac % 7b1% 7d % 7b1% 2be % 5e % 7b % 5cbeta_c + % 5 ccdot + x % 7d % 7d + % 5 cquad % 26 + if + % 5 cquad + c % 3d1% 0a % 5 cend % 7 bcases % 7d % 0a % 5 cend % 7 bequation % 7d ">

 

2. Maximum Likelihood Estimate and Maximum a Posteriori Estimate (1), Maximum Likelihood Estimate

Suppose there is a dataset. To train a model, the maximum likelihood method is usually used to determine the model parameters:

(2) Maximum a Posteriori Estimate

Assuming that the distribution of model parameters is subject to, the optimal parameters we want to find on a given dataset meet the following requirements:

You can use the formula above to define the loss function for solving the problem:

] "Src =" http://chart.apis.google.com/chart? Cht = tx & chlorophyll = % 5 cLeftrightarrow ++ arg % 5climit _ % 5 cbeta + min % 5 cquad +-% 5b + % 5 csum % 5climit _ % 7bj % 3cn % 7 dlog + % 5 cquad + p (c_j % 7cx_j % 2c % 5 cbeta) % 2b + % 5 csum % 5climit _ % 7bj % 3cn % 7 dlog + % 5 cquad + p (% 5cbeta_j % 7c % 5 cdelta % 5e2) % 0a % 5d ">

In my opinion, from the perspective of statistical learning, the first part of the formula described the deviation (empirical risk), and the second part described the variance (confidence risk ).

3. L1-regularized model and L2-regularized model

The following assumptions can be made about the distribution of model parameters:

(1), Gaussian Prior

(2) Laplace Prior

At that time, called L2-regularized:

When "src =" http://chart.apis.google.com/chart? Cht = tx & chlorophyll = % 5 cbeta + % 5c % 7 eLaplace + Prior % 0a ">, called L1-regularized:

Here, the constant is an adjustment factor used to regulate deviation and variance:

● Very small, emphasizing likelihood will cause Overfit;

● When regularization is emphasized, Underfit may occur.

Under the same conditions, the comparison between Gaussian Prior and Laplace Prior is as follows:

Figure 1-Laplace Prior in red and Gaussian Prior in black

 

4. L1-regularized model? Or L2-regularized model?

Currently the mainstream methods are chosen to use L1-regularized, including a variety of L-BFGS (such as: OWL-QN) and a variety of SGD methods, the main reasons are as follows:

● Our goal is:

] "Src =" http://chart.apis.google.com/chart? Cht = tx & chlorophyll = % 5 cbeta % 3d + + arg % 5climit _ % 5 cbeta + min % 5 cquad +-% 5b + % 5 csum % 5climit _ % 7bj % 3cn % 7 dlog + % 5 cquad + p (c_j % 7cx_j % 2c % 5 cbeta) % 2b + % 5 csum % 5climit _ % 7bj % 3cn % 7 dlog + % 5 cquad + p (% 5cbeta_j % 7c % 5 cdelta % 5e2) % 0a % 5d ">

As shown in figure 1, to obtain the maximum value, the weight vector must be close to its mean value (that is, 0). Obviously, the weight vector following Laplace Prior is slower than Gaussian Prior;

● The gradient descent algorithm is used as an example. The weight update method is as follows:

○ Gaussian Prior:

○ Laplace Prior:

At that time ,;

At that time ,.

When the same number indicates that there is no false score, the absolute value of the weight will be updated at a relatively small speed, and when it is different from the number, the false score occurs, the absolute value of the weight is updated at a relatively large speed.

● Regard weight update as two stages: likelihood + regularization. If likelihood is not considered for the time being, the following relationship exists after k iterations:

○ Gaussian Prior:

○ Laplace Prior:

At that time ,;

At that time ,.

When the limit value of the former is 0, but it is not accurate to 0, and the latter updates a constant each time, this means that the latter may accurately update the weight to 0.

● The L1-regularized can obtain sparse feature, so the feature selection is performed simultaneously during model training.

● If the input vector is sparse, Laplace Prior can ensure that its gradient is sparse.

 

L1-Stochastic Gradient Descent1, Naive Stochastic Gradient Descent

The principle of the random gradient descent algorithm is to estimate the gradient value of the target function by using a subset of the randomly selected Training Set. In extreme cases, the selected subset contains only one Sample, in this case, the weight update method is as follows:

\ Begin {cases}
& 1 & x> 0 \\
& 0 & x = 0 \\
-& 1 & x <0 \\
\ End {cases}
\ End {equation} "src =" http://chart.apis.google.com/chart? Cht = tx & chlorophyll = sign (x) % 3d % 5 cbegin % 7 bequation % 7d % 0a % 5 cbegin % 7 bcases % 7d % 0a % 261 + % 26x % 3e0% 5c % 0a % 260 + + % 26x % 3d0% 5c % 5c % 0a-% 261 + % 26x % 3c0% 5c % 5c % 0a % 5 cend % 7 bcases % 7d % 0a % 5 cend % 7 bequation % 7d ">

The disadvantages of this update method are as follows:

● Each iteration requires L1 penalty for each feature, including unused feature whose value is 0;

● In practice, the probability of updating the weight value to 0 during iteration is very small, which means that many feature values are still not 0.

2. Lazy Stochastic Gradient Descent

To address the above problems, Carpenter has effectively improved Lazy Sparse Stochastic Gradient Descent for Regularized Mutlinomial Logistic Regression (2008) in his paper. The weight update method is as follows:

The advantages of this update method are as follows:

● With such truncation, the penalty item does not change the direction of the function value symbol, and the zero weight can naturally appear;

● Lazy fashion is used in the algorithm, and the feature with the value 0 is not updated, thus speeding up the training.

Disadvantages of this method:

● Due to the wide range of methods used to estimate the real gradient, fluctuations in weight updates may occur, for example:

3. Stochastic Gradient Descent with Cumulative Penalty

This method comes from the Stochastic Gradient Descent Training for L1-regularized Log-linear Models

The weight update method of Cumulative Penalty (2009) is as follows:

 

"Src =" http://chart.apis.google.com/chart? Cht = tx & chlorophyll = % 5cbeta _ % 7bi % 7d % 5e % 7bk % 2b1% 7d % 3d % 5 cmin (0% 2c % 5cbeta_ I % 5e % 7bk % 2b % 5 cfrac % 7b1% 7d % 7b2% 7d % 7d % 7d % 7d + % 2b + (u % 5ek-q_ I % 5e % 7bk-1% 7d )) + % 0a ">

Where:

Indicates the cumulative penalty value that can be obtained theoretically when each weight is in the k iteration;

Indicates the cumulative penalty value of the current weight.

The algorithm is described as follows:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The traditional method for determining the learning rate is:

Where k is the k iteration

The convergence speed of this method in practice is not ideal. This paper proposes the following methods:

Where k is the k iteration

It performs better in practice, but in theory it cannot guarantee the final convergence, but in practice there are limits on the maximum number of iterations, so this is not a big problem.

Compared with the L1-regularized method proposed by Galen Andrew and Jianfeng Gao's Scalable training of OWL-QN log-linear models (2007:

 

4. Online Stochastic Gradient Descent

Since the update item of L1-regularized weight iteration is constant and has nothing to do with the weight, the effect of batch update of Sample in N is the same as that of updating a Sample N times at a time, therefore, you only need to store a Sample and model parameters in the memory.

5. parallreceived Stochastic Gradient Descent

Martin A. Zinkevich, Markus Weimer, Alex Smola and Lihong Li. In Parallelized Stochastic Gradient Descent, the following describes A simple and intuitive method for parallelization:

 

And

In the next step, we will try to implement this algorithm on Spark. It will be tested in practice.

 

Iii. References

1. Galen Andrew and Jianfeng Gao. 2007. Scalable training of L1-regularized log-linear models. In Proceedings of ICML, pages 33-40.

2. Bob Carpenter. 2008. Lazy sparse stochastic gradient descent for regularized multinomial logistic regression. Technical report, Alias-I.

3. Martin A. Zinkevich, Markus weich, Alex Smola and Lihong Li. Parallelized Stochastic Gradient Descent. Yahoo! Labs

4. John Langford, Lihong Li, and Tong Zhang. 2009. Sparse online learning via truncated gradient. The Journal of Machine Learning Research (JMLR), 10: 777-801.

5. Charles elkan.201. Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training.

 

4. Related open-source software

1. wapiti: http://wapiti.limsi.fr/

2, sgd2.0: http://mloss.org/revision/view/842/

3. scikit-learn: http://scikit-learn.org/stable/

4. Vowpal Wabbit: http://hunch.net /~ Vw/

5. deeplearning: http://deeplearning.net/

6. LingPipe: http://alias-i.com/lingpipe/index.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.