Gradient descent (GD) is a common method of minimizing risk function, loss function, random gradient descent and batch gradient descent are two kinds of iterative solutions, the following from the formula and the implementation of the analysis of both, if there is any aspect of the wrong, I hope users correct.
The following H (x) is the function to be fitted, J (theta) loss function, Theta is the parameter, to iterate the value of the solution, theta solved the function h (theta) that is finally to fit out. where M is the number of record bars in the training set, J is the number of parameters.
1. The solution of batch gradient descent is as follows:
(1) The J (Theta) is biased to the theta, and the corresponding gradient of each theta is obtained.
(2) Since the risk function is minimized, each theta is updated according to the gradient negative direction of each parameter theta
(3) from the above formula can be noted that it is a global optimal solution, but every iteration of a step, the training set to use all the data, if M is large, then imagine this method of iteration speed!! So, this introduces another method, a random gradient descent.
2. The method of solving the stochastic gradient descent is as follows:
(1) The above risk function can be written as follows, the loss function corresponds to the granularity of each sample in the training set, and the above batch gradient drop corresponds to all training samples:
(2) The loss function of each sample, the corresponding gradient is obtained for Theta, to update the theta
(3) The random gradient descent is to iterate through each sample to update once, if the sample size is very large (for example, hundreds of thousands of), then perhaps only tens of thousands of or thousands of of the sample, it has been theta iterative to the optimal solution, compared to the batch gradient above the lower, iterative need to use a hundred thousand of training samples, One iteration is unlikely to be optimal, and if you iterate 10 times, you need to traverse the training sample 10 times. However, one of the problems associated with SGD is that the noise is more bgd, making SGD not each iteration toward the overall optimization direction.
3, for the above linear regression problem, compared with the batch gradient descent, the stochastic gradient descent solution will be the optimal solution?
(1) Batch gradient descent---Minimize the loss function of all training samples, so that the final solution is the global optimal solution, that is, the solution of the parameters is to minimize the risk function.
(2) The random gradient descent---Minimize the loss function of each sample, although not every iteration of the loss function is toward the global optimal direction, but the direction of the large whole is to the global optimal solution, the final result is often near the global optimal solution.
4, gradient descent is used to find the optimal solution, which problems can be obtained global optimal? Which problems may be the local optimal solution?
For the above linear regression problem, the optimization problem on the Theta distribution is unimodal, that is, from the graph above only one peak, so the gradient of the final result is the global optimal solution. For multimodal problems , however, because there are multiple peak values, it is possible that the final result of the gradient descent is local optimization.
5. Differences in the implementation of random gradients and batch gradients
An example of the implementation of the blog post NMF is shown in the previous article, which shows the difference between the two implementations (note: In fact, the code for Python to be more intuitive, to practice writing more python! )
[Java]View PlainCopy
- Random gradient descent, updating parameters
- Public void Updatepq_stochastic (double alpha, double beta) {
- For (int i = 0; i < M; i++) {
- arraylist<feature> Ri = this.dataset.getDataAt (i). Getallfeature ();
- For (Feature rij:ri) {
- //EIJ=RIJ.WEIGHT-PQ for updating P and Q
- double PQ = 0;
- For (int k = 0; k < K; k++) {
- PQ + = p[i][k] * Q[k][rij.dim];
- }
- double Eij = RIJ.WEIGHT-PQ;
- //Update Pik and qkj
- For (int k = 0; k < K; k++) {
- double Oldpik = p[i][k];
- P[I][K] + = Alpha
- * (2 * EIJ * Q[k][rij.dim]-beta * p[i][k]);
- Q[k][rij.dim] + = Alpha
- * (2 * eij * Oldpik-beta * Q[k][rij.dim]);
- }
- }
- }
- }
- Batch gradient drop, update parameters
- Public void Updatepq_batch (double alpha, double beta) {
- For (int i = 0; i < M; i++) {
- arraylist<feature> Ri = this.dataset.getDataAt (i). Getallfeature ();
- For (Feature rij:ri) {
- //RIJ.ERROR=RIJ.WEIGHT-PQ for updating P and Q
- double PQ = 0;
- For (int k = 0; k < K; k++) {
- PQ + = p[i][k] * Q[k][rij.dim];
- }
- Rij.error = RIJ.WEIGHT-PQ;
- }
- }
- For (int i = 0; i < M; i++) {
- arraylist<feature> Ri = this.dataset.getDataAt (i). Getallfeature ();
- For (Feature rij:ri) {
- For (int k = 0; k < K; k++) {
- //Cumulative entry for parameter updates
- double eq_sum = 0;
- double ep_sum = 0;
- for (int ki = 0; ki < M; ki++) {//fixed k and J, all I Plus and
- arraylist<feature> tmp = this.dataset.getDataAt (i). Getallfeature ();
- For (Feature rj:tmp) {
- if (Rj.dim = = Rij.dim)
- Ep_sum + = p[ki][k] * RJ.ERROR;
- }
- }
- for (Feature rj:ri) {//fixed k and I, for many J-items plus
- Eq_sum + = Rj.error * Q[k][rj.dim];
- }
- //For parameter updates
- P[I][K] + = Alpha * (2 * eq_sum-beta * p[i][k]);
- Q[k][rij.dim] + = Alpha * (2 * ep_sum-beta * Q[k][rij.dim]);
- }
- }
- }
- }
Source: http://blog.csdn.net/lilyth_lilyth/article/details/8973972
Formula comparison and implementation comparison of random gradient descent (Stochastic gradient descent) and batch gradient descent (batch gradient descent) [Turn]