Gradient descent method and its Python implementation

Last Update:2016-06-02 Source: Internet

Author: User

Tags diff

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The gradient descent method (gradient descent), aka steepest descent (steepest descent), is the most common method for solving unconstrained optimization problems, it is an iterative method, and the main operation of each step is to solve the gradient vectors of the objective function. The negative gradient direction of the current position is used as the search direction (because the target function falls fastest in that direction, which is also the origin of the steepest descent method name).
Gradient Descent method features: The closer to the target value, the smaller the step, the slower the descent rate.
visually, as shown in:

each circle here represents a function gradient, the center of the function extreme point, each iteration based on the current position of the gradient (used to determine the direction of the search and with the stride speed together) and the step to find a new position, so that the iteration eventually reached the target function local optimal point (if the objective function is a convex function , the global optimal point is reached).

below we will use the formula to specify the gradient descent method
the following H (θ) is our fitting function

It can also be expressed in vector form:

The following function is the risk function that we need to optimize, each of which represents the residuals between our fitted function and Y on the existing training set, and calculates its square loss function as the risk function we construct (see least squares and their Python implementations)

Here we multiply 1/2 is to facilitate the subsequent partial derivative when the result is more concise, the reason can multiply 1/2 is because of this coefficient after the solution to the optimal value of the risk function has no effect.
Our goal is to minimize the risk function so that our fitting function can fit the target function y to the maximum extent:

the following specific gradient solution is carried out around this goal.

Batch gradient descent bgd
According to the traditional idea, we need to obtain the partial derivative of each of the above risk functions, and get each corresponding gradient

This represents the J component of the first sample point, i.e. h (θ).

Next, because we want to minimize the risk function, we update each one according to the negative gradient direction of each parameter.

the alpha here indicates the step size

from the above formula can be noted that it is a global optimal solution, but each iteration step, will be used to the training set all the data, if M is large, then imagine this method of iteration speed!! So, this introduces another method, a random gradient descent.

random gradient descent sgd
because the batch gradient descent is very slow in the case of a large training set, it is not feasible to use the batch gradient descent to solve the optimization problem of the risk function in this case, and in this case, the stochastic gradient descent is proposed.
we rewrite the risk function described above into the following form:

among them,

A loss function called a sample point

Next we have a loss function for each sample, to each of the partial derivative, to get each corresponding gradient

Each parameter is then updated according to the negative gradient direction of each

compared to the batch gradient descent, the random gradient descent uses only one sample per iteration, and in the case of large sample sizes, it is common to iterate the θ to the optimal solution using only a subset of the sample data. Therefore, the decrease of random gradient is much less than that of batch gradient reduction in computational amount.

One drawback of SGD is that it is more noisy than BGD, making SGD not each iteration toward the overall optimization direction. and SGD because each time uses a sample to iterate, so the final optimal solution is often not the global optimal solution, but only the local optimal solution. But the direction of the large whole is to the global optimal solution, the final result is often near the global optimal solution.

Here are the graphical presentations of the two methods:

as can be seen from the above figure, SGD because each time a sample point gradient search, so its optimal path looks more blind (this is also the origin of the random gradient descent name).

Compare the advantages and disadvantages of the following points:
Batch gradient drop:
Advantages : Global optimal solution, easy to implement in parallel, not much iteration number
Cons : When the number of samples is large, the training process is slow, and each iteration takes a lot of time.

random gradient descent:
Advantages : Fast training speed, small amount of calculation per iteration
Disadvantage : The decrease of accuracy is not the global optimal; it is not easy to implement in parallel; The total number of iterations is more.

============ Split Segmentation =============
Above we explain what is gradient descent and how to solve the gradient descent, we will use Python to achieve gradient descent method.

# _*_ Coding:utf-8 _*_# Author: yhao# blog: http://blog.csdn.net/yhao2014# mailbox: [email protected]# Training Set # Each sample point has 3 components (x0,x1,x 2) x = [(1, 0., 3), (1, 1., 3), (1, 2., 3), (1, 3., 2), (1, 4., 4)]# y[i] Sample point corresponds to Output y = [95.364, 97.217205, 75.195834, 60.105  519, 49.342380]# iteration threshold, when the difference of two iterations loss function is less than the threshold stop iteration epsilon = 0.0001# Learning rate Alpha = 0.01diff = [0, 0]max_itor = 1000error1 = 0error0 =  0cnt = 0m = Len (x) # initialization parameter theta0 = 0THETA1 = 0theta2 = 0while true:cnt + = 1 # parameter iteration calculation for I in range (m): #  Fit function y = theta0 * x[0] + theta1 * x[1] +THETA2 * x[2] # calculation residuals diff[0] = (theta0 + theta1 * x[i][1] + theta2 * X[i][2])-y[i] # gradient = diff[0] * X[i][j] THETA0-= Alpha * diff[0] * x[i][0] theta1-= Alpha * dif F[0] * X[i][1] THETA2-= Alpha * diff[0] * x[i][2] # calculate loss function Error1 = 0 for LP in range (len (x)): Err Or1 + = (y[i]-(theta0 + theta1 * x[i][1] + theta2 * x[i][2])) **2/2 if ABS (ERROR1-ERROR0) < Epsilon:break E Lse:error0 = ErrOr1 print ' theta0:%f, Theta1:%f, Theta2:%f, Error1:%f '% (theta0, theta1, Theta2, error1) print ' DONE:THETA0: %f, Theta1:%f, theta2:%f '% (theta0, theta1, theta2) print ' Iteration count:%d '% cnt

Results (Interception section):

 theta0:2.782632, theta1:3.207850, theta2:7.998823, error1:7.508687 theta0:4.254302, theta1:3.809652, Theta2: 11.972218, error1:813.550287 theta0:5.154766, theta1:3.351648, theta2:14.188535, error1:1686.507256 theta0:5. 800348, theta1:2.489862, theta2:15.617995, error1:2086.492788 theta0:6.326710, theta1:1.500854, theta2:16.6769 error1:2204.562407 theta0:6.792409, theta1:0.499552, theta2:17.545335, error1:2194.779569 theta0:74.892395 , Theta1: -13.494257, theta2:8.587471, error1:87.700881 theta0:74.942294, theta1: -13.493667, theta2:8.571632, E rror1:87.372640 theta0:74.992087, Theta1: -13.493079, theta2:8.555828, error1:87.045719 theta0:75.041771, theta  1: -13.492491, theta2:8.540057, error1:86.720115 theta0:75.091349, theta1: -13.491905, theta2:8.524321, Error1: 86.395820 theta0:75.140820, Theta1: -13.491320, theta2:8.508618, error1:86.072830 theta0:75.190184, theta1:-13 .490736, theta2:8.4929error1:85.751139 theta0:75.239442, Theta1: -13.490154, theta2:8.477315, error1:85.430741 theta0:97.986390, Theta1: -13.221172, theta2:1.257259, error1:1.553781 theta0:97.986505, theta1: -13.221170, theta2:1.257223, Erro r1:1.553680 theta0:97.986620, Theta1: -13.221169, theta2:1.257186, error1:1.553579 theta0:97.986735, Theta1:- 13.221167, theta2:1.257150, error1:1.553479 theta0:97.986849, theta1: -13.221166, theta2:1.257113, error1:1.553 379 theta0:97.986963, Theta1: -13.221165, theta2:1.257077, error1:1.553278done:theta0:97.987078, theta1:-13.22 1163, theta2:1.257041 iteration number: 3443

you can see the final convergence to a stable parameter value.

Note: Here in the Select Alpha and the Epsilon you need to choose carefully, and a value that may not be appropriate will cause the final failure to converge.

Reference Documentation:

Comparison of formulas for random gradient descent (Stochastic gradient descent) and batch gradient descent (batch gradient descent)

Random Gradient Descent method

A gradient descent algorithm for Python

Gradient descent method and its Python implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More