Machine Learning FAQ _ Several gradient descent method __ Machine Learning

Source: Internet
Author: User
first, gradient descent method

In the machine learning algorithm, for many supervised learning models, the loss function of the original model needs to be constructed, then the loss function is optimized by the optimization algorithm in order to find the optimal parameter. In the optimization algorithm of machine learning parameters, the optimization algorithm based on gradient descent (gradient descent, GD) is used.

The gradient descent method has many advantages, in which, in the process of solving the gradient descent method, only the first derivative of the loss function is solved, the cost of the calculation is relatively small, which makes the gradient descent method be applied to many large-scale datasets. The meaning of the gradient descent method is to find the new iteration point through the gradient direction of the current point.

The basic idea can be understood as follows: We start at a certain point in the mountain, find one of the steepest slopes (that is, find the gradient direction), reach a point, then find the steepest slope, take one step, until we continue to walk so, go to the "low" point (minimum cost function convergence point).

        
As shown in the above figure, the local optimal solution is obtained. X,y represents the THETA0 and theta1,z direction is the cost function, it is obvious that the starting point is different, the final convergence point may not be the same. Of course, if it is a bowl-shaped, then the convergence point should be the same. second, gradient descent method of deformation form

In the process of specific use of gradient descent method, there are several different varieties, namely: Batch, Mini-batch, SGD. The main difference is that different variants are on the selection of training data.

1. Batch gradient descent method Bgd
The batch gradient descent method (Batch gradient descent) is for the whole dataset, and the gradient direction is solved by calculating all the samples.
The loss function of the batch gradient descent method is:
       
The iterative formula for further mass gradient descent is:
    
Each iteration, you need to use the training set all the data, if the number of samples is large, then the iterative speed of this method can be imagined.
  Advantages: Global optimal solution, easy to parallel implementation;
  disadvantage: When the number of samples is very large, the training process will be very slow.
The number of BGD iterations is relatively small in terms of the number of iterations. The schematic diagram of its iterative convergence curve can be expressed as follows:
       
       
2, small batch gradient descent method Mbgd
All the samples are used in each iteration of the batch gradient method mentioned above, and for a particularly large amount of data, such as large-scale machine learning applications, it takes a lot of computational cost to solve all the samples in each iteration. Is it possible to replace all samples with a partial sample during each iteration? Based on this idea, there is the concept of mini-batch.
Assuming that the number of samples in the training set is 1000, then each mini-batch is only a subset, assuming that each mini-batch contains 10 samples, so that the entire training dataset can be divided into 100 mini-batch. Pseudo code is as follows:
     

3. Random gradient descent method SGD

The stochastic gradient descent algorithm (stochastic gradient descent) can be considered as a special case of mini-batch gradient descent, in which the parameters in the model are adjusted at one time in the random gradient descent method, Equivalent to the Mini-batch gradient descent in the above B=1 case, that is, there is only one training sample per Mini-batch.
The optimization process of stochastic gradient descent method is:

Random gradient descent is an iterative update through each sample, if the sample size is very large (for example, hundreds of thousands of), then it is possible to use only tens of thousands of or thousands of of the samples, the theta has been iterative to the optimal solution, compared to the above batch gradient drop, the iteration needs to use a hundred thousand of training samples, One iteration is not likely to be optimal, and if you iterate 10 times, you need to traverse the training sample 10 times. However, a problem with SGD is that there is more noise than BGD, so that SGD is not going to be the overall optimization direction for each iteration.
  Advantages: The training speed is fast;
  disadvantage: The accuracy is reduced, is not the global optimal, is not easy to realize in parallel.
In terms of iterations, the number of SGD iterations is more, and the search process in the solution space looks blind. The schematic diagram of its iterative convergence curve can be expressed as follows:
   the three popular understanding gradient descent

(1) batch gradient descent -Minimize the loss function of all training samples (and then update the parameters after all the training data are obtained), so that the final solution is the global optimal solution, that is, the parameter of the solution is to make the risk function minimum. The gradient drop is similar to looking around at a certain point in the mountain, calculating the fastest descending direction (multidimensional), and then taking a step, which belongs to an iteration. Batch gradient Drop one iteration will update all theta, and each update is heading in the steepest direction.

(2) random gradient descent -minimizing the loss function of each sample, although not every iteration gets the loss function toward the global optimal direction, but the direction of the large whole is to the global optimal solution, the final result is often near the global optimal solution. Random means I use an example from a sample to approximate all of my samples, to adjust the theta, it will not calculate the maximum slope direction, but only one dimension at a time to take a step; Fall an iteration only update a theta, the newspaper is not rigorous to go to see the attitude forward. four random gradient descent code

Load data;  % Import x,y,test_feature
epsilon = 0.0001;% convergence threshold
alpha = 0.001;% learning rate
k = 1;% iteration
n = size (x,2);% characteristics +1
m = Size (x,1); % Training Sample number
theta = zeros (n,1);
Theta_new = zeros (n,1);
converge = 0;
while (converge==0)    % does not converge
        for (i=1:m)        % repeated use of M training samples, each sample is updated once parameter
            J (k) = 1/2 * (Norm (x*theta-y)) ^2;
            for (j = 1:n)
                theta_new (j) = Theta (j)-alpha* (X (i,:) *theta-y (i,:)) *x (i,j);
            End;
            If Norm (Theta_new-theta) < epsilon
                converge=1;
                theta = theta_new;
                break;
            else
                theta = theta_new;
                K = k + 1;
            End end
        ;
End

Related literature:
Http://www.zhizhihu.com/html/y2011/3632.html

Http://www.th7.cn/system/win/201511/142910.shtml

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.