Some common problems in machine learning _ gradient descent method

Source: Internet
Author: User

First, gradient descent method

In the machine learning algorithm, for many supervised learning models, we need to construct the loss function for the original model, and then optimize the loss function by optimizing the algorithm to find the optimal parameters. In the optimization algorithm for solving machine learning parameters, the optimization algorithm based on gradient descent (Gradient descent, GD) is used more.

Gradient descent method has many advantages, in which, in the process of the gradient descent method, only the first derivative of the loss function is solved, the calculation cost is relatively small, which makes the gradient descent method can be applied to many large scale data sets. The meaning of the gradient descent method is to find the new iteration point through the gradient direction of the current point.

The basic idea can be understood as follows: We start from a certain point on the mountain, find one of the steepest slopes to take one step (that is, to find the gradient direction), to reach a point, then find the steepest slope, then take one step, until we continue to walk, to the most "low" point (minimum cost function convergence point).

        
As shown, the local optimal solution is obtained. X, y indicates that the THETA0 and theta1,z directions represent the cost function, and it is clear that the starting point is different, and the final convergence point may not be the same. Of course, if it is bowl-shaped, then the convergence point should be the same.

Two, the gradient descent method deformation form

In the process of using gradient descent method, there are several different variants, namely batch, Mini-batch, SGD. The main difference is that different variants are selected on the training data.

1, Batch gradient descent method Bgd
A batch gradient descent method (batch Gradient descent) is for the entire data set, and the direction of the gradient is solved by calculating all the samples.
The loss function of the batch gradient descent method is:
       
The iterative formula for further batch gradient descent is:
    
Every iteration of the training set to use all the data, if the sample number is large, it is conceivable that the iterative speed of this method!
  Advantages: Global optimal solution, easy parallel implementation;
  Cons: The training process is slow when the number of samples is large.
From the number of iterations, the number of BGD iterations is relatively small. The convergence curve of its iteration can be expressed as follows:
       
       
2, small batch gradient descent method Mbgd
In the way of the batch gradient above, all samples are used for each iteration, and for a particularly large amount of data, such as large-scale machine learning applications, it takes a lot of computational cost for each iteration to solve all the samples. Is it possible to use part of the sample in place of all the samples during each iteration? Based on this idea, there is the concept of mini-batch.
Assuming that the number of samples in the training set is 1000, each mini-batch is only a subset, assuming that each mini-batch contains 10 samples, so that the entire training data set can be divided into 100 mini-batch. The pseudo code is as follows:
     

3. Random Gradient Descent method sgd

The stochastic gradient descent algorithm (stochastic gradient descent) can be seen as a special case of mini-batch gradient descent, i.e., the parameters in the model are adjusted only one sample at a time in the random gradient descent method, Mini-batch gradient descent, which is equivalent to the B=1 case described above, has only one training sample per Mini-batch.
The optimization process of the stochastic gradient descent method is:

The random gradient descent is updated once per sample, if the sample size is large (for example, hundreds of thousands of), then it is possible to use only tens of thousands of or thousands of of the samples, the theta has been iterated to the optimal solution, compared with the batch gradient above, the iteration needs to use a hundred thousand of training samples, One iteration is unlikely to be optimal, and if you iterate 10 times, you need to traverse the training sample 10 times. However, one of the problems associated with SGD is that the noise is more bgd, making SGD not each iteration toward the overall optimization direction.
  Advantages: fast training speed;
  disadvantage: Lower accuracy, not global optimal;
From the number of iterations, the number of SGD iterations is more, and the search process in the solution space looks very blind. The convergence curve of its iteration can be expressed as follows:
  

Related literature:
Http://www.zhizhihu.com/html/y2011/3632.html

Http://www.th7.cn/system/win/201511/142910.shtml

Some common problems in machine learning _ gradient descent method

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.