Gradient descent method of deformation-random gradient descent-minibatch-parallel random gradient descent

Source: Internet
Author: User

Introduction of the problem:

Consider a typical supervised machine learning problem, given the M training sample s={x (i), Y (i)}, to obtain a set of weights Wby minimizing the empirical risk, the objective function to be optimized for the entire training set is now:

The loss function for a single training sample (X (i), Y (i)), the loss of a single sample is expressed as follows:

The introduction of L2 is introduced in the loss function, then the final loss is:

Note that a single sample introduces a loss of (not divided by M):

The explanation of regularization

The regularization term here can prevent overfitting, notice that a regular term is introduced in the overall loss function, and the general introduction of regularization is as follows:

where L (W) is the overall loss, here in fact:

Here c can be represented, such as the following two different regular methods:

Here is a two-dimensional example diagram: We limit the model space to a l1-ball of W. To facilitate visualization, we consider a two-dimensional case where the contour of the objective function can be drawn on the (W1, W2) plane, while the constraint becomes a norm ball with a radius of C on the plane. The first intersection of the contour line and the norm ball is the optimal solution.

As you can see, the difference between the L1-ball and the L2-ball is that the L1 has "horns" in place where each axis intersects, and that the geodesic of the objective function will intersect at the corner most of the time unless the position is very well placed. Notice that the position of the corner will be sparse, the intersection point in the example has w1=0, and the higher dimension (imagine what the three-dimensional l1-ball is?). In addition to the corner, there are many edges of the contour is also a large probability of becoming the first intersection of the place, and will produce sparsity, in contrast, L2-ball has no such nature, because there is no angle, so the first intersection of the place in the sparse position of the probability becomes very small.

Therefore, one sentence summary is: L1 will tend to produce a small number of features, while the other features are 0, and L2 will choose more features, these features will be close to 0. Lasso is very useful in feature selection, and ridge is just a rule.

Batch Gradient Descent

With the above basic optimization formula, you can use Gradient descent to solve the equation, assuming that the W Dimension is n, first of all, the standard batch Gradient descent algorithm:

Repeat until convergency{

For J=1;j<n; J + +:

    

}

The batch gradient descent algorithm here iterates through all the samples each iteration, determining the optimal direction together by all the samples.

Stochastic Gradient descent

A random gradient descent is an update every time a sample is taken from all the training samples, so that each time you do not have to traverse all the datasets, the iteration will be fast, but it will add many iterations, because the direction of each selection is not necessarily the optimal direction.

Repeat until convergency{

Random Choice J from all M training example:

    

}

Mini-batch Gradient Descent

This is a tradeoff between the two methods, each randomly selected Mini-batch size B (b<m), b usually take 10, or (2...100), which saves the calculation of the entire batch of time, while the direction of Mini-batch calculation will be more accurate.

Repeat until convergency{

For J=1;j<n; J+=b:

    

}

Finally, see the parallelized SGD:

If the final v reaches the convergence condition and ends execution, or returns to the first for loop to continue execution, the same method applies to the Minibatch gradient descent.

Gradient descent method of deformation-random gradient descent-minibatch-parallel random gradient descent

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.