Mini-batch Gradient Descent

Source: Internet
Author: User
http://m.blog.csdn.net/article/details?id=51188876

first, gradient descent method

In the machine learning algorithm, for many supervised learning models, we need to construct the loss function for the original model, and then optimize the loss function by optimizing the algorithm to find the optimal parameters. In the optimization algorithm for solving machine learning parameters, the optimization algorithm based on gradient descent (Gradient descent, GD) is used more.

Gradient descent method has many advantages, in which, in the process of the gradient descent method, only the first derivative of the loss function is solved, the calculation cost is relatively small, which makes the gradient descent method can be applied to many large scale data sets. The meaning of the gradient descent method is to find the new iteration point through the gradient direction of the current point.

The basic idea can be understood as follows: We start from a certain point on the mountain, find one of the steepest slopes to take one step (that is, to find the gradient direction), to reach a point, then find the steepest slope, then take one step, until we continue to walk, to the most "low" point (minimum cost function convergence point).


As shown above, the local optimal solution is obtained. X, y indicates that the THETA0 and theta1,z directions represent the cost function, and it is clear that the starting point is different, and the final convergence point may not be the same. Of course, if it is bowl-shaped, then the convergence point should be the same. two, the gradient descent method deformation form

In the process of using gradient descent method, there are several different variants, namely batch, Mini-batch, SGD. The main difference is that different variants are selected on the training data.

1, batch gradient descent method bgd
A batch gradient descent method (batch Gradient descent) is for the entire data set, and the direction of the gradient is solved by calculating all the samples.
The loss function of the batch gradient descent method is:

The iterative formula for further batch gradient descent is:

Every iteration takes all the data from the training set, and if the number of samples is large, the iterative speed of the method can be imagined.
Advantages: Global optimal solution, easy parallel implementation;
Cons: The training process is slow when the number of samples is large.
From the number of iterations, the number of BGD iterations is relatively small. The convergence curve diagram of its iteration can be expressed as follows:


2, small batch gradient descent method mbgd
In the way of the batch gradient above, all samples are used for each iteration, and for a particularly large amount of data, such as large-scale machine learning applications, it takes a lot of computational cost for each iteration to solve all the samples. Is it possible to use part of the sample in place of all the samples during each iteration? Based on this idea, there is the concept of mini-batch.
Assuming that the number of samples in the training set is 1000, each mini-batch is only a subset, assuming that each mini-batch contains 10 samples, so that the entire training data set can be divided into 100 mini-batch. The pseudo code is as follows:
     

3. Random Gradient Descent method sgd

The stochastic gradient descent algorithm (stochastic gradient descent) can be seen as a special case of mini-batch gradient descent, i.e., the parameters in the model are adjusted only one sample at a time in the random gradient descent method, Mini-batch gradient descent, which is equivalent to the B=1 case described above, has only one training sample per Mini-batch.
The optimization process of the stochastic gradient descent method is:

The random gradient descent is updated once per sample, if the sample size is large (for example, hundreds of thousands of), then it is possible to use only tens of thousands of or thousands of of the samples, the theta has been iterated to the optimal solution, compared with the batch gradient above, the iteration needs to use a hundred thousand of training samples, One iteration is unlikely to be optimal, and if you iterate 10 times, you need to traverse the training sample 10 times. However, one of the problems associated with SGD is that the noise is more bgd, making SGD not each iteration toward the overall optimization direction.
Advantages: fast training speed;
Disadvantage: Lower accuracy, not global optimal;
From the number of iterations, the number of SGD iterations is more, and the search process in the solution space looks very blind. The convergence curve diagram of its iteration can be expressed as follows:
   three-popular understanding gradient descent

(1) Batch gradient descent-minimize the loss function of all training samples (and then update the parameters after all the training data), so that the final solution is the global optimal solution, that is, the solution of the parameters is to minimize the risk function. The batch gradient drop is similar to looking around at a certain point in the mountain, calculating the fastest descending direction (multidimensional), and then stepping out, which belongs to an iteration. Batch gradient Drop one iteration will update all theta, and each update will move in the steepest direction.

(2) random gradient descent-minimizing the loss function of each sample, although not every iteration of the loss function is toward the global optimal direction, but the direction of the large whole is to the global optimal solution, the final result is often near the global optimal solution. Random that is, I use an example in the sample to approximate all of my samples, to adjust the theta, which does not calculate the maximum slope direction, but only one dimension at a time to take one step; down one iteration only updates a certain theta, reporting the attitude of not rigorous walk look forward. four random gradient descent code

Load data; % import x,y,test_feature epsilon = 0.0001; % convergence threshold alpha = 0.001; % Learning rate k = 1; % iterations n = size (x,2); % characteristic number +1 m = size (x,1);
% Training Sample Number theta = Zeros (n,1);
Theta_new = zeros (n,1);
converge = 0; while (converge==0)% not convergent for (i=1:m)% repeated use of M training samples, each sample is updated once the parameter J (k) = the * (Norm (x*theta-y)) ^
            2;
            for (j = 1:n) theta_new (j) = Theta (j)-alpha* (X (i,:) *theta-y (i,:)) *x (i,j);
            End
                If Norm (Theta_new-theta) < epsilon Converge=1;
                theta = theta_new;
            Break
                else theta = theta_new;
            K = k + 1;
End END;
      
      
       
       End
       
       1 2 3 4 5 6 7 8 9 10 11 12 13 14 1
      5 16 17 
       18 19 20 21 22 23 24
      
       25

Related literature:
Http://www.zhizhihu.com/html/y2011/3632.html

Http://www.th7.cn/system/win/201511/142910.shtml


======================================================= the following link to see first
http://www.cnblogs.com/python27/p/MachineLearningWeek10.html


gradient descent (BGD), random gradient descent (SGD), Mini-batch Gradient descent, SGD tag with mini-batch : Neural network gradient descent random gradient descent 2015-03-01 12:05 3633 people read reviews (1) Collection report Category: Network Architecture (1)

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Directory (?) [+] I. Regression function and objective function

The mean square error is used as the objective function (loss function) to minimize its value and to optimize the upper formula.

second, the optimization method (Gradient descent) 1. The steepest gradient descent method

Also called batch gradient descent method batch Gradient Descent,bsd
A, the derivation of the objective function

B. Move the theta in the opposite direction of the derivative

Reason: (1) For the objective function, the theta should be moved as follows, where a is the step and p is the direction vector.


(2) to J (Theta) to do first-order Taylor series expansion:

(3) in the formula, AK is the step length, as a positive number, we know that to make the objective function smaller, it should be <0, and its absolute value should be larger the better, so the speed of decline faster. In Taylor series, G represents the gradient of J (Theta K), so in order to make it negative and the absolute value is the largest, the theta should be moved in the opposite direction to the gradient G.


2. Random Gradient Descent method (stochastic gradient descent,sgd)

SGD is a variant of the steepest gradient descent method.

Using the steepest gradient descent method, n iterations are performed until the target function converges or reaches a certain convergence limit. The M samples are calculated for each iteration, and the computational amount is large.

For easy calculation, SGD calculates the gradient for only one sample per iteration until it converges. The pseudo-code is as follows (only one loop below, which can actually have multiple loops until convergence):


(1) Since SGD uses only one training sample per iteration, this method can also be used as online learning.

(2) using only one sample iteration at a time, it is easy to get into local optimal solution if the noise is encountered.
3, Mini-batch Gradient descent

(1) This is an optimization algorithm between BSD and SGD. Each time a certain amount of training samples are selected for iteration.

(2) The formula seems to be able to produce the following analysis: Faster than BSD, slower than SGD, the accuracy is lower than BSD, higher than SGD. 4. SGD with Mini-batch

(1) Select N Training Samples (N<m,m is the total training set of samples)

(2) n iterations in these n samples, 1 samples per use

(3) Weighted average sum of n gradient obtained by n iterations as this mini-batch descent gradient

(4) Keep repeating the above steps in the training set until the convergence.

=============================================

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.