Formula comparison and implementation comparison of random gradient descent (Stochastic gradient descent) and batch gradient descent (batch gradient descent) [Turn]

Source: Internet
Author: User

Gradient descent (GD) is a common method of minimizing risk function, loss function, random gradient descent and batch gradient descent are two kinds of iterative solutions, the following from the formula and the implementation of the analysis of both, if there is any aspect of the wrong, I hope users correct.

The following H (x) is the function to be fitted, J (theta) loss function, Theta is the parameter, to iterate the value of the solution, theta solved the function h (theta) that is finally to fit out. where M is the number of record bars in the training set, J is the number of parameters.

1. The solution of batch gradient descent is as follows:

(1) The J (Theta) is biased to the theta, and the corresponding gradient of each theta is obtained.

(2) Since the risk function is minimized, each theta is updated according to the gradient negative direction of each parameter theta

(3) from the above formula can be noted that it is a global optimal solution, but every iteration of a step, the training set to use all the data, if M is large, then imagine this method of iteration speed!! So, this introduces another method, a random gradient descent.

2. The method of solving the stochastic gradient descent is as follows:

(1) The above risk function can be written as follows, the loss function corresponds to the granularity of each sample in the training set, and the above batch gradient drop corresponds to all training samples:

(2) The loss function of each sample, the corresponding gradient is obtained for Theta, to update the theta

(3) The random gradient descent is to iterate through each sample to update once, if the sample size is very large (for example, hundreds of thousands of), then perhaps only tens of thousands of or thousands of of the sample, it has been theta iterative to the optimal solution, compared to the batch gradient above the lower, iterative need to use a hundred thousand of training samples, One iteration is unlikely to be optimal, and if you iterate 10 times, you need to traverse the training sample 10 times. However, one of the problems associated with SGD is that the noise is more bgd, making SGD not each iteration toward the overall optimization direction.

3, for the above linear regression problem, compared with the batch gradient descent, the stochastic gradient descent solution will be the optimal solution?

(1) Batch gradient descent---Minimize the loss function of all training samples, so that the final solution is the global optimal solution, that is, the solution of the parameters is to minimize the risk function.

(2) The random gradient descent---Minimize the loss function of each sample, although not every iteration of the loss function is toward the global optimal direction, but the direction of the large whole is to the global optimal solution, the final result is often near the global optimal solution.

4, gradient descent is used to find the optimal solution, which problems can be obtained global optimal? Which problems may be the local optimal solution?

For the above linear regression problem, the optimization problem on the Theta distribution is unimodal, that is, from the graph above only one peak, so the gradient of the final result is the global optimal solution. For multimodal problems , however, because there are multiple peak values, it is possible that the final result of the gradient descent is local optimization.


5. Differences in the implementation of random gradients and batch gradients

An example of the implementation of the blog post NMF is shown in the previous article, which shows the difference between the two implementations (note: In fact, the code for Python to be more intuitive, to practice writing more python! )

[Java]View PlainCopy
  1. Random gradient descent, updating parameters
  2. Public void Updatepq_stochastic (double alpha, double beta) {
  3. For (int i = 0; i < M; i++) {
  4. arraylist<feature> Ri = this.dataset.getDataAt (i). Getallfeature ();
  5. For (Feature rij:ri) {
  6. //EIJ=RIJ.WEIGHT-PQ for updating P and Q
  7. double PQ = 0;
  8. For (int k = 0; k < K; k++) {
  9. PQ + = p[i][k] * Q[k][rij.dim];
  10. }
  11. double Eij = RIJ.WEIGHT-PQ;
  12. //Update Pik and qkj
  13. For (int k = 0; k < K; k++) {
  14. double Oldpik = p[i][k];
  15. P[I][K] + = Alpha
  16. * (2 * EIJ * Q[k][rij.dim]-beta * p[i][k]);
  17. Q[k][rij.dim] + = Alpha
  18. * (2 * eij * Oldpik-beta * Q[k][rij.dim]);
  19. }
  20. }
  21. }
  22. }
  23. Batch gradient drop, update parameters
  24. Public void Updatepq_batch (double alpha, double beta) {
  25. For (int i = 0; i < M; i++) {
  26. arraylist<feature> Ri = this.dataset.getDataAt (i). Getallfeature ();
  27. For (Feature rij:ri) {
  28. //RIJ.ERROR=RIJ.WEIGHT-PQ for updating P and Q
  29. double PQ = 0;
  30. For (int k = 0; k < K; k++) {
  31. PQ + = p[i][k] * Q[k][rij.dim];
  32. }
  33. Rij.error = RIJ.WEIGHT-PQ;
  34. }
  35. }
  36. For (int i = 0; i < M; i++) {
  37. arraylist<feature> Ri = this.dataset.getDataAt (i). Getallfeature ();
  38. For (Feature rij:ri) {
  39. For (int k = 0; k < K; k++) {
  40. //Cumulative entry for parameter updates
  41. double eq_sum = 0;
  42. double ep_sum = 0;
  43. for (int ki = 0; ki < M; ki++) {//fixed k and J, all I Plus and
  44. arraylist<feature> tmp = this.dataset.getDataAt (i). Getallfeature ();
  45. For (Feature rj:tmp) {
  46. if (Rj.dim = = Rij.dim)
  47. Ep_sum + = p[ki][k] * RJ.ERROR;
  48. }
  49. }
  50. for (Feature rj:ri) {//fixed k and I, for many J-items plus
  51. Eq_sum + = Rj.error * Q[k][rj.dim];
  52. }
  53. //For parameter updates
  54. P[I][K] + = Alpha * (2 * eq_sum-beta * p[i][k]);
  55. Q[k][rij.dim] + = Alpha * (2 * ep_sum-beta * Q[k][rij.dim]);
  56. }
  57. }
  58. }
  59. }

Source: http://blog.csdn.net/lilyth_lilyth/article/details/8973972

Formula comparison and implementation comparison of random gradient descent (Stochastic gradient descent) and batch gradient descent (batch gradient descent) [Turn]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.