Formula comparison and implementation comparison of random gradient descent (Stochastic gradient descent) and batch gradient descent (batch gradient descent) [Turn]

Last Update:2016-12-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Gradient descent (GD) is a common method of minimizing risk function, loss function, random gradient descent and batch gradient descent are two kinds of iterative solutions, the following from the formula and the implementation of the analysis of both, if there is any aspect of the wrong, I hope users correct.

The following H (x) is the function to be fitted, J (theta) loss function, Theta is the parameter, to iterate the value of the solution, theta solved the function h (theta) that is finally to fit out. where M is the number of record bars in the training set, J is the number of parameters.

1. The solution of batch gradient descent is as follows:

(1) The J (Theta) is biased to the theta, and the corresponding gradient of each theta is obtained.

(2) Since the risk function is minimized, each theta is updated according to the gradient negative direction of each parameter theta

(3) from the above formula can be noted that it is a global optimal solution, but every iteration of a step, the training set to use all the data, if M is large, then imagine this method of iteration speed!! So, this introduces another method, a random gradient descent.

2. The method of solving the stochastic gradient descent is as follows:

(1) The above risk function can be written as follows, the loss function corresponds to the granularity of each sample in the training set, and the above batch gradient drop corresponds to all training samples:

(2) The loss function of each sample, the corresponding gradient is obtained for Theta, to update the theta

(3) The random gradient descent is to iterate through each sample to update once, if the sample size is very large (for example, hundreds of thousands of), then perhaps only tens of thousands of or thousands of of the sample, it has been theta iterative to the optimal solution, compared to the batch gradient above the lower, iterative need to use a hundred thousand of training samples, One iteration is unlikely to be optimal, and if you iterate 10 times, you need to traverse the training sample 10 times. However, one of the problems associated with SGD is that the noise is more bgd, making SGD not each iteration toward the overall optimization direction.

3, for the above linear regression problem, compared with the batch gradient descent, the stochastic gradient descent solution will be the optimal solution?

(1) Batch gradient descent---Minimize the loss function of all training samples, so that the final solution is the global optimal solution, that is, the solution of the parameters is to minimize the risk function.

(2) The random gradient descent---Minimize the loss function of each sample, although not every iteration of the loss function is toward the global optimal direction, but the direction of the large whole is to the global optimal solution, the final result is often near the global optimal solution.

4, gradient descent is used to find the optimal solution, which problems can be obtained global optimal? Which problems may be the local optimal solution?

For the above linear regression problem, the optimization problem on the Theta distribution is unimodal, that is, from the graph above only one peak, so the gradient of the final result is the global optimal solution. For multimodal problems , however, because there are multiple peak values, it is possible that the final result of the gradient descent is local optimization.

5. Differences in the implementation of random gradients and batch gradients

An example of the implementation of the blog post NMF is shown in the previous article, which shows the difference between the two implementations (note: In fact, the code for Python to be more intuitive, to practice writing more python! ）

[Java]View PlainCopy

Random gradient descent, updating parameters
Public void Updatepq_stochastic (double alpha, double beta) {
For (int i = 0; i < M; i++) {
arraylist<feature> Ri = this.dataset.getDataAt (i). Getallfeature ();
For (Feature rij:ri) {
//EIJ=RIJ.WEIGHT-PQ for updating P and Q
double PQ = 0;
For (int k = 0; k < K; k++) {
PQ + = p[i][k] * Q[k][rij.dim];
}
double Eij = RIJ.WEIGHT-PQ;
//Update Pik and qkj
For (int k = 0; k < K; k++) {
double Oldpik = p[i][k];
P[I][K] + = Alpha
* (2 * EIJ * Q[k][rij.dim]-beta * p[i][k]);
Q[k][rij.dim] + = Alpha
* (2 * eij * Oldpik-beta * Q[k][rij.dim]);
}
}
}
}
Batch gradient drop, update parameters
Public void Updatepq_batch (double alpha, double beta) {
For (int i = 0; i < M; i++) {
arraylist<feature> Ri = this.dataset.getDataAt (i). Getallfeature ();
For (Feature rij:ri) {
//RIJ.ERROR=RIJ.WEIGHT-PQ for updating P and Q
double PQ = 0;
For (int k = 0; k < K; k++) {
PQ + = p[i][k] * Q[k][rij.dim];
}
Rij.error = RIJ.WEIGHT-PQ;
}
}
For (int i = 0; i < M; i++) {
arraylist<feature> Ri = this.dataset.getDataAt (i). Getallfeature ();
For (Feature rij:ri) {
For (int k = 0; k < K; k++) {
//Cumulative entry for parameter updates
double eq_sum = 0;
double ep_sum = 0;
for (int ki = 0; ki < M; ki++) {//fixed k and J, all I Plus and
arraylist<feature> tmp = this.dataset.getDataAt (i). Getallfeature ();
For (Feature rj:tmp) {
if (Rj.dim = = Rij.dim)
Ep_sum + = p[ki][k] * RJ.ERROR;
}
}
for (Feature rj:ri) {//fixed k and I, for many J-items plus
Eq_sum + = Rj.error * Q[k][rj.dim];
}
//For parameter updates
P[I][K] + = Alpha * (2 * eq_sum-beta * p[i][k]);
Q[k][rij.dim] + = Alpha * (2 * ep_sum-beta * Q[k][rij.dim]);
}
}
}
}

Source: http://blog.csdn.net/lilyth_lilyth/article/details/8973972

Formula comparison and implementation comparison of random gradient descent (Stochastic gradient descent) and batch gradient descent (batch gradient descent) [Turn]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Formula comparison and implementation comparison of random gradient descent (Stochastic gradient descent) and batch gradient descent (batch gradient descent) [Turn]

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Formula comparison and implementation comparison of random gradient descent (Stochastic gradient descent) and batch gradient descent (batch gradient descent) [Turn]

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support