In machine learning, the gradient descent (Gradient descent, GD) algorithm only needs to calculate the first derivative of the loss function, the calculation cost is small, it is very suitable for the training data very large application.
The physical meaning of the gradient descent method is well understood, that is, the line is searched along the gradient direction of the current point, and the next iteration point is found. But, why does it derive the GD algorithm of batch, Minibatch, online?
Originally, the difference between batch, Minibatch, SGD, online is the choice of training data:

Batch 
Minibatch 
Stochastic 
Online 
Training set 
Fixed 
Fixed 
Fixed 
Realtime updates 
Number of samples in a single iteration 
Entire training set 
Subset of training sets 
Single sample 
According to the specific legal 
Complexity of the algorithm 
High 
So so 
Low 
Low 
Timeliness 
Low 
General (Delta model) 
General (Delta model) 
High 
1. Batch GD
The gradient direction calculation for each iteration is determined by the covoting of all training samples,
The loss function of batch GD is:
\[j (\theta) = \frac{1}{{2m}}\sum\limits_{i = 1}^m {{H_\theta (i)}}){x^{(i)}}) y^{}} \]
The training algorithm is:
\[\BEGIN{ARRAY}{L}
repeate\{\ \
\theta: = \theta\alpha \frac{1}{m}\sum\limits_{i = 1}^m ({H_\theta} ({x^{(i)}}){y^{(i)}}) x_j^{(i)}\\
\}
\end{array}\]
What do you mean, the batch GD algorithm calculates the gradient direction of the loss function over the entire training set , searching for the next iteration point along that direction. "Batch" means that all the samples in the training set participate in each iteration of the cycle.
2. Minibatch GD
Batch GD each iteration requires the participation of all samples, and for largescale machine learning applications, there are often billionlevel training sets, with very high computational complexity. Therefore, some scholars suggest that, anyway, the training set is just a sampling set of data distribution, can we only use part of the training set sample at each iteration? This is the Minibatch algorithm.
Assuming that the training set has m samples, and each minibatch (a subset of the training set) has a B sample, the entire training set can be divided into m/b minibatch. We use \ (\omega \) to represent a minibatch, with \ ({\omega _j}\) representing all Minibatch collections in the Jround iteration, with:
\[\omega = \{{\omega _k}:k = 1,2...m/b\} \]
Then, the Minibatch GD algorithm flow is as follows:
\[\BEGIN{ARRAY}{L}
repeate\{\ \
{\rm{}}repeate\{\ \
{\rm{for each}} {\omega _k} {\rm{in}}\omega: \ \
{\rm{}}\theta: = \theta\alpha \frac{1}{b}\sum\limits_{i = 1}^b ({H_\theta} ({x^{(i)}}){y^{(i)}}) {x^{(i)})}}\\
{\rm{}}\} for (k = 1,2...m/b) \ \
\}
\end{array}\]
3. Stochastic GD (SGD)
The stochastic gradient descent algorithm (SGD) is a special application of Minibatch GD. SGD is equivalent to B=1 minibatch GD. That is, there is only one training sample in each minibatch.
4. Online GD
As the Internet industry boomed, data became more and more "inexpensive". Many applications have realtime, uninterrupted training data generation. elearning (online learning) algorithm is a training algorithm to make full use of realtime data.
The difference between Online GD and Minibatch GD/SGD is that all training data is used only once and then discarded. The benefit of this is that the trend of the final model can be changed. For example, the clickthrough rate (CTR) Model of search ads will change the behavior of Internet users over time. The batch algorithm (which is updated once a day) takes a long time (requires retraining of all historical data) and, on the other hand, fails to respond to the user's click Behavior Migration. The online leaning algorithm allows realtime migration of the endusers ' click behavior.
REF:
1. Http://en.wikipedia.org/wiki/Gradient_descent
"Original" BATCHGD, SGD, MINIBATCHGD, Stochastic GD, ONLINEGDgradient training algorithm under the background of big data