Gradient descent method of deformation-random gradient descent-minibatch-parallel random gradient descent

Last Update:2016-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction of the problem:

Consider a typical supervised machine learning problem, given the M training sample s={x (i), Y (i)}, to obtain a set of weights Wby minimizing the empirical risk, the objective function to be optimized for the entire training set is now:

The loss function for a single training sample (X (i), Y (i)), the loss of a single sample is expressed as follows:

The introduction of L2 is introduced in the loss function, then the final loss is:

Note that a single sample introduces a loss of (not divided by M):

The explanation of regularization

The regularization term here can prevent overfitting, notice that a regular term is introduced in the overall loss function, and the general introduction of regularization is as follows:

where L (W) is the overall loss, here in fact:

Here c can be represented, such as the following two different regular methods:

Here is a two-dimensional example diagram: We limit the model space to a l1-ball of W. To facilitate visualization, we consider a two-dimensional case where the contour of the objective function can be drawn on the (W1, W2) plane, while the constraint becomes a norm ball with a radius of C on the plane. The first intersection of the contour line and the norm ball is the optimal solution.

As you can see, the difference between the L1-ball and the L2-ball is that the L1 has "horns" in place where each axis intersects, and that the geodesic of the objective function will intersect at the corner most of the time unless the position is very well placed. Notice that the position of the corner will be sparse, the intersection point in the example has w1=0, and the higher dimension (imagine what the three-dimensional l1-ball is?). In addition to the corner, there are many edges of the contour is also a large probability of becoming the first intersection of the place, and will produce sparsity, in contrast, L2-ball has no such nature, because there is no angle, so the first intersection of the place in the sparse position of the probability becomes very small.

Therefore, one sentence summary is: L1 will tend to produce a small number of features, while the other features are 0, and L2 will choose more features, these features will be close to 0. Lasso is very useful in feature selection, and ridge is just a rule.

Batch Gradient Descent

With the above basic optimization formula, you can use Gradient descent to solve the equation, assuming that the W Dimension is n, first of all, the standard batch Gradient descent algorithm:

Repeat until convergency{

For J=1;j<n; J + +:

}

The batch gradient descent algorithm here iterates through all the samples each iteration, determining the optimal direction together by all the samples.

Stochastic Gradient descent

A random gradient descent is an update every time a sample is taken from all the training samples, so that each time you do not have to traverse all the datasets, the iteration will be fast, but it will add many iterations, because the direction of each selection is not necessarily the optimal direction.

Repeat until convergency{

Random Choice J from all M training example:

}

Mini-batch Gradient Descent

This is a tradeoff between the two methods, each randomly selected Mini-batch size B (b<m), b usually take 10, or (2...100), which saves the calculation of the entire batch of time, while the direction of Mini-batch calculation will be more accurate.

Repeat until convergency{

For J=1;j<n; J+=b:

}

Finally, see the parallelized SGD:

If the final v reaches the convergence condition and ends execution, or returns to the first for loop to continue execution, the same method applies to the Minibatch gradient descent.

Gradient descent method of deformation-random gradient descent-minibatch-parallel random gradient descent

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Gradient descent method of deformation-random gradient descent-minibatch-parallel random gradient descent

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Gradient descent method of deformation-random gradient descent-minibatch-parallel random gradient descent

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support