Gradient descent method and its Python implementation

Source: Internet
Author: User
Tags diff

The gradient descent method (gradient descent), aka steepest descent (steepest descent), is the most common method for solving unconstrained optimization problems, it is an iterative method, and the main operation of each step is to solve the gradient vectors of the objective function. The negative gradient direction of the current position is used as the search direction (because the target function falls fastest in that direction, which is also the origin of the steepest descent method name).
Gradient Descent method features: The closer to the target value, the smaller the step, the slower the descent rate.
Visually, as shown in:


Here each circle represents a function gradient, the most central representation of the function extreme point, each iteration based on the current position of the gradient (used to determine the search direction and with the stride speed together to determine the pace) and the step to find a new position, so that the iteration eventually reached the target function local optimal point (if the objective function is a convex function, The global optimal point is reached).


Below we will use the formula to specify the gradient descent method
The following H (θ) is our fitting function


It can also be expressed in vector form:


The following function is the risk function that we need to optimize, each of which represents the residuals between our fitted function and Y on the existing training set, and calculates its square loss function as the risk function we construct (see least squares and their python implementations)


Here we multiply 1/2 is to facilitate the subsequent partial derivative when the result is more concise, the reason can multiply 1/2 is because of this coefficient after the solution to the optimal value of the risk function has no effect.
Our goal is to minimize the risk function so that our fitting function can fit the target function y to the maximum extent:


The following specific gradient solution is carried out around this goal.


Batch gradient descent bgd
According to the traditional idea, we need to obtain the partial derivative of each of the above risk functions, and get each corresponding gradient


This represents the J component of the first sample point, i.e. h (θ).


Next, because we want to minimize the risk function, we update each one in the negative gradient direction of each parameter.


The alpha here indicates the step size


From the above formula can be noted that it is a global optimal solution, but each iteration step, will be used to the training set all the data, if M is large, then imagine this method of iteration speed!! So, this introduces another method, a random gradient descent.


random gradient descent sgd
Because the batch gradient descent is very slow in the case of a large training set, it is not feasible to use the batch gradient descent to solve the optimization problem of the risk function in this case, and in this case, the stochastic gradient descent is proposed.
We rewrite the risk function described above into the following form:


which


A loss function called a sample point


Next we have a loss function for each sample, to each of the partial derivative, to get each corresponding gradient


Each parameter is then updated according to the negative gradient direction of each


Compared to the batch gradient descent, the random gradient descent uses only one sample per iteration, and in the case of large sample sizes, it is common to iterate the θ to the optimal solution using only a subset of the sample data. Therefore, the decrease of random gradient is much less than that of batch gradient reduction in computational amount.


One drawback of SGD is that it is more noisy than BGD, making SGD not each iteration toward the overall optimization direction. and SGD because each time uses a sample to iterate, so the final optimal solution is often not the global optimal solution, but only the local optimal solution. But the direction of the large whole is to the global optimal solution, the final result is often near the global optimal solution.


Here are the graphical presentations of the two methods:



As can be seen from the above figure, SGD because each time a sample point gradient search, so its optimal path looks more blind (this is also the origin of the random gradient descent name).


Compare the advantages and disadvantages of the following points:
Batch gradient drop:
Advantages: Global optimal solution, easy to implement in parallel, not much iteration number
Disadvantages: When the number of samples is large, the training process is slow, and each iteration takes a lot of time.

random gradient descent:
Advantages: Fast training speed with little calculation per iteration
Disadvantages: Reduced accuracy, not global optimal, not easy to implement in parallel, the total number of iterations is more.



============ Split Segmentation =============
Above we explain what is gradient descent and how to solve the gradient descent, we will use Python to achieve gradient descent method.

[Python]View PlainCopy
  1. # _*_ Coding:utf-8 _*_
  2. # Author: Yhao
  3. # Blog: http://blog.csdn.net/yhao2014
  4. # email: [Email protected]
  5. # Training Set
  6. # 3 components per sample point (X0,X1,X2)
  7. x = [(1, 0., 3), (1, 1., 3), (1, 2., 3), (1, 3. , 2), ( 1, 4., C14>4)]
  8. # Y[i] Output for sample points
  9. y = [95.364, 97.217205, 75.195834, 60.105519, 49.342380]
  10. # iterate threshold, stop iteration when the difference between the two iterations loss function is less than the threshold value
  11. Epsilon = 0.0001
  12. # Learning Rate
  13. Alpha = 0.01
  14. diff = [0, 0]
  15. Max_itor = +
  16. Error1 = 0
  17. Error0 = 0
  18. CNT = 0
  19. m = Len (x)
  20. # Initialization Parameters
  21. THETA0 = 0
  22. Theta1 = 0
  23. Theta2 = 0
  24. While True:
  25. CNT + = 1
  26. # parameter Iteration calculation
  27. For I in range (m):
  28. # fit function = y = theta0 * x[0] + theta1 * x[1] +theta2 * x[2]
  29. # Calculate Residuals
  30. diff[0] = (theta0 + theta1 * x[i][1] + theta2 * x[i][2])-y[i]
  31. # gradient = diff[0] * X[i][j]
  32. THETA0-= Alpha * diff[0] * x[i][0]
  33. THETA1-= Alpha * diff[0] * x[i][1]
  34. THETA2-= Alpha * diff[0] * x[i][2]
  35. # Calculate Loss function
  36. Error1 = 0
  37. For LP in Range (len (x)):
  38. Error1 + = (y[lp]-(theta0 + theta1 * x[lp][1] + theta2 * x[lp][2])) * *2/2
  39. if ABS (ERROR1-ERROR0) < epsilon:
  40. Break
  41. Else:
  42. Error0 = Error1
  43. print ' theta0:%f, Theta1:%f, Theta2:%f, Error1:%f '% (theta0, theta1, Theta2, Error1)
  44. Print ' done:theta0:%f, Theta1:%f, theta2:%f '% (theta0, theta1, Theta2)
  45. Print ' iteration count:%d '% cnt


Results (interception section):

[Plain]View PlainCopy
  1. theta0:2.782632, theta1:3.207850, theta2:7.998823, error1:7.508687
  2. theta0:4.254302, theta1:3.809652, theta2:11.972218, error1:813.550287
  3. theta0:5.154766, theta1:3.351648, theta2:14.188535, error1:1686.507256
  4. theta0:5.800348, theta1:2.489862, theta2:15.617995, error1:2086.492788
  5. theta0:6.326710, theta1:1.500854, theta2:16.676947, error1:2204.562407
  6. theta0:6.792409, theta1:0.499552, theta2:17.545335, error1:2194.779569
  7. theta0:74.892395, Theta1: -13.494257, theta2:8.587471, error1:87.700881
  8. theta0:74.942294, Theta1: -13.493667, theta2:8.571632, error1:87.372640
  9. theta0:74.992087, Theta1: -13.493079, theta2:8.555828, error1:87.045719
  10. theta0:75.041771, Theta1: -13.492491, theta2:8.540057, error1:86.720115
  11. theta0:75.091349, Theta1: -13.491905, theta2:8.524321, error1:86.395820
  12. theta0:75.140820, Theta1: -13.491320, theta2:8.508618, error1:86.072830
  13. theta0:75.190184, Theta1: -13.490736, theta2:8.492950, error1:85.751139
  14. theta0:75.239442, Theta1: -13.490154, theta2:8.477315, error1:85.430741
  15. theta0:97.986390, Theta1: -13.221172, theta2:1.257259, error1:1.553781
  16. theta0:97.986505, Theta1: -13.221170, theta2:1.257223, error1:1.553680
  17. theta0:97.986620, Theta1: -13.221169, theta2:1.257186, error1:1.553579
  18. theta0:97.986735, Theta1: -13.221167, theta2:1.257150, error1:1.553479
  19. theta0:97.986849, Theta1: -13.221166, theta2:1.257113, error1:1.553379
  20. theta0:97.986963, Theta1: -13.221165, theta2:1.257077, error1:1.553278
  21. done:theta0:97.987078, Theta1: -13.221163, theta2:1.257041
  22. Iteration count: 3443


You can see the final convergence to a stable parameter value.

Note: This is a careful choice when selecting Alpha and epsilon, which may cause the final failure to converge.

Reference Documentation:

A gradient descent algorithm for Python

(turn) Gradient descent method and its Python implementation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.