"Original" BATCH-GD, SGD, MINI-BATCH-GD, Stochastic GD, ONLINE-GD--gradient training algorithm under the background of big data

Source: Internet
Author: User

In machine learning, the gradient descent (Gradient descent, GD) algorithm only needs to calculate the first derivative of the loss function, the calculation cost is small, it is very suitable for the training data very large application.

The physical meaning of the gradient descent method is well understood, that is, the line is searched along the gradient direction of the current point, and the next iteration point is found. But, why does it derive the GD algorithm of batch, Mini-batch, online?

Originally, the difference between batch, Mini-batch, SGD, online is the choice of training data:

Batch Mini-batch Stochastic Online
Training set Fixed Fixed Fixed Real-time updates
Number of samples in a single iteration Entire training set Subset of training sets Single sample According to the specific legal
Complexity of the algorithm High So so Low Low
Timeliness Low General (Delta model) General (Delta model) High

1. Batch GD

The gradient direction calculation for each iteration is determined by the co-voting of all training samples,

The loss function of batch GD is:

\[j (\theta) = \frac{1}{{2m}}\sum\limits_{i = 1}^m {{H_\theta (i)}})-{x^{(i)}}) y^{}} \]

The training algorithm is:

\[\BEGIN{ARRAY}{L}
repeate\{\ \
\theta: = \theta-\alpha \frac{1}{m}\sum\limits_{i = 1}^m ({H_\theta} ({x^{(i)}})-{y^{(i)}}) x_j^{(i)}\\
\}
\end{array}\]

What do you mean, the batch GD algorithm calculates the gradient direction of the loss function over the entire training set , searching for the next iteration point along that direction. "Batch" means that all the samples in the training set participate in each iteration of the cycle.

2. Mini-batch GD

Batch GD each iteration requires the participation of all samples, and for large-scale machine learning applications, there are often billion-level training sets, with very high computational complexity. Therefore, some scholars suggest that, anyway, the training set is just a sampling set of data distribution, can we only use part of the training set sample at each iteration? This is the Mini-batch algorithm.

Assuming that the training set has m samples, and each mini-batch (a subset of the training set) has a B sample, the entire training set can be divided into m/b mini-batch. We use \ (\omega \) to represent a mini-batch, with \ ({\omega _j}\) representing all Mini-batch collections in the J-round iteration, with:

\[\omega = \{{\omega _k}:k = 1,2...m/b\} \]

Then, the Mini-batch GD algorithm flow is as follows:

\[\BEGIN{ARRAY}{L}
repeate\{\ \
{\rm{}}repeate\{\ \
{\rm{for each}} {\omega _k} {\rm{in}}\omega: \ \
{\rm{}}\theta: = \theta-\alpha \frac{1}{b}\sum\limits_{i = 1}^b ({H_\theta} ({x^{(i)}})-{y^{(i)}}) {x^{(i)})}}\\
{\rm{}}\} for (k = 1,2...m/b) \ \
\}
\end{array}\]

3. Stochastic GD (SGD)

The stochastic gradient descent algorithm (SGD) is a special application of Mini-batch GD. SGD is equivalent to B=1 mini-batch GD. That is, there is only one training sample in each mini-batch.

4. Online GD

As the Internet industry boomed, data became more and more "inexpensive". Many applications have real-time, uninterrupted training data generation. elearning (online learning) algorithm is a training algorithm to make full use of real-time data.

The difference between Online GD and Mini-batch GD/SGD is that all training data is used only once and then discarded. The benefit of this is that the trend of the final model can be changed. For example, the click-through rate (CTR) Model of search ads will change the behavior of Internet users over time. The batch algorithm (which is updated once a day) takes a long time (requires retraining of all historical data) and, on the other hand, fails to respond to the user's click Behavior Migration. The online leaning algorithm allows real-time migration of the end-users ' click behavior.

REF:

1. Http://en.wikipedia.org/wiki/Gradient_descent

"Original" BATCH-GD, SGD, MINI-BATCH-GD, Stochastic GD, ONLINE-GD--gradient training algorithm under the background of big data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.