FITTING A MODEL VIA closed-form equations VS. GRADIENT Descent vs STOCHASTIC GRADIENT descent vs Mini-batch learning. What's the difference?

Last Update:2015-12-14 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

FITTING A MODEL VIA closed-form equations VS. GRADIENT Descent vs STOCHASTIC GRADIENT descent vs Mini-batch learning. What's the difference?

In order to explain the differences between alternative approaches to estimating the parameters of a model, let's take a l Ook at a concrete example:ordinary Least squares (OLS) Linear Regression. The illustration below shall serve as a quick reminder to recall the different components of a simple linear regression mo Del:

In Ordinary Least squares (OLS) Linear Regression, our goal are to find the line (or hyperplane) that minimizes the Vertica L offsets. Or, in other words, we define the best-fitting line as the line that minimizes the sum of squared errors (SSE) or mean squ Ared error (MSE) between our target variable (y) and our predicted output through all samples I in our dataset of Si Ze n.

Now, we can implement a linear regression model for performing ordinary least squares regression using one of the Followin G Approaches:

Solving the Model parameters analytically (Closed-form equations)
Using an optimization algorithm (Gradient descent, Stochastic Gradient descent, Newton ' s method, Simplex method, etc.)

1) NORMAL equations (Closed-form solution)

The Closed-form solution may (should) being preferred for "smaller" datasets – if Computing (a "costly") matrix inverse is n OT a concern. For very large datasets, or datasets where the inverse of XTX are not exist (the matrix is non-invertible or singular, e.g ., in case of Perfect multicollinearity), the GD or SGD approaches is to be preferred. The linear function (linear regression model) is defined as:

Where y is the response variable, x is a m-dimensional sample vector, and W is the We ight vector (vector of coefficients). Note that W0 represents the y-axis intercept of the model and therefore x0=1. Using the Closed-form solution (normal equation), we compute the weights of the model as follows:

2) GRADIENT descent (GD)

Using the Gradient Decent (GD) optimization algorithm, the weights is updated incrementally after each epoch (= pass Over The training dataset).

The cost function J, the sum of squared errors (SSE), can be written as:

The magnitude and direction of the weight update is computed by taking a step in the opposite direction of the cost Gradie Nt

Where η is the learning rate. The weights is then updated through each epoch via the following update rule:

Whereδw is a vector this contains the weight updates of each weight coefficient w, which be computed as follows :

Essentially, we can picture GD optimization as a hiker (the weight coefficient) "Who wants to climb" a mountain (cost F unction) into a valley (cost minimum), and each step are determined by the steepness of the slope (gradient) and the leg le Ngth of the Hiker (learning rate). Considering a cost function with only a single weight coefficient, we can illustrate this concept as follows:

3) STOCHASTIC GRADIENT descent (SGD)

In GD optimization, we compute the cost gradient based on the complete training set; Hence, we sometimes also call itbatch GD. In case of very large datasets, using GD can is quite costly since we is only taking a single step to one pass over the Training set-thus, the larger the training set, the slower our algorithm updates the weights and the longer it could take Until it converges to the global cost minimum (note, the SSE cost function is convex).

In Stochastic Gradient descent (SGD; sometimes also referred to as iterative or on-line GD), we don ' t AC Cumulate the weight updates as we ' ve seen above for GD:

Instead, we update the weights after each training sample:

Here, the term "stochastic" comes from the fact, the gradient based on a single training sample is a "stochastic appro Ximation "of the" true "cost gradient. Due to it stochastic nature, the path towards the global cost minimum are not ' direct ' as in GD and may go ' zig-zag ' if w E is visualizing the cost surface in a 2D space. However, it has been shown, SGD almost surely converges to the global cost minimum if the cost function is convex (or Pseudo-convex) [1]. Furthermore, there is different tricks to improve the gd-based learning, for example:

An adaptive learning rateηchoosing a decrease constant D , shrinks the learning rate over time:
Momentum learning by adding a factor of the previous gradient to the weight update for faster updates:

A NOTE about SHUFFLING

There is several different flavors of SGD, which can be all seen throughout the literature. Let's take a look at the three most common variants:

Randomly shuffle samples in the training set
- For one or more epochs, or until approx. cost minimum is reached
  - For training sample I
    - Compute gradients and perform weight updates

For one or more epochs, or until approx. cost minimum is reached
- Randomly shuffle samples in the training set
  - For training sample I
    - Compute gradients and perform weight updates

For iterations T, or until approx. cost minimum is reached:
- Draw random sample from the training set
  - Compute gradients and perform weight updates

In scenario A [3], we shuffle the training set is only one time in the beginning; Whereas in scenario B, we shuffle the training set after each epoch to prevent repeating update cycles. In both scenario A and scenario B, each training sample was only used once per epoch to update the model weights.

In scenario C, we draw the training samples randomly with replacement from the training set [2]. If the number of iterationsT is equal to the number of training samples, we learn the model based on a bootst Rap sample of the training set.

4) Mini-batch GRADIENT descent (MB-GD)

Mini-batch Gradient descent (MB-GD) a compromise between Batch GD and SGD. In MB-GD, we update the model based on smaller groups of training samples; Instead of computing the gradient from 1 sample (SGD) or all n Training Samples (GD), we compute the gradient fro M 1 < K < n Training samples (a common mini-batch size is k=50).

Mb-gd converges in fewer iterations than GD because we update the weights more frequently; However, MB-GD let's us utilize vectorized operation, which typically results in a computational performance gain over SGD .

REFERENCES

[1] Bottou, Léon (1998). "Online Algorithms and Stochastic approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 978-0-521-65263-6
[2] Bottou, Léon. "Large-scale machine learning with SGD." Proceedings of CompStat ' 2010. Physica-verlag HD, 2010. 177-186.
[3] Bottou, Léon. "SGD tricks." Neural networks:tricks of the trade. Springer Berlin Heidelberg, 2012. 421-436.

FITTING A MODEL VIA closed-form equations VS. GRADIENT Descent vs STOCHASTIC GRADIENT descent vs Mini-batch learning. What's the difference?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More