Gradient descent method and its Python implementation

Last Update:2017-09-22 Source: Internet

Author: User

Tags diff

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The gradient descent method (gradient descent), aka steepest descent (steepest descent), is the most common method for solving unconstrained optimization problems, it is an iterative method, and the main operation of each step is to solve the gradient vectors of the objective function. The negative gradient direction of the current position is used as the search direction (because the target function falls fastest in that direction, which is also the origin of the steepest descent method name).
Gradient Descent method features: The closer to the target value, the smaller the step, the slower the descent rate.
Visually, as shown in:

Here each circle represents a function gradient, the most central representation of the function extreme point, each iteration based on the current position of the gradient (used to determine the search direction and with the stride speed together to determine the pace) and the step to find a new position, so that the iteration eventually reached the target function local optimal point (if the objective function is a convex function, The global optimal point is reached).

Below we will use the formula to specify the gradient descent method
The following H (θ) is our fitting function

It can also be expressed in vector form:

The following function is the risk function that we need to optimize, each of which represents the residuals between our fitted function and Y on the existing training set, and calculates its square loss function as the risk function we construct (see least squares and their python implementations)

Here we multiply 1/2 is to facilitate the subsequent partial derivative when the result is more concise, the reason can multiply 1/2 is because of this coefficient after the solution to the optimal value of the risk function has no effect.
Our goal is to minimize the risk function so that our fitting function can fit the target function y to the maximum extent:

The following specific gradient solution is carried out around this goal.

Batch gradient descent bgd
According to the traditional idea, we need to obtain the partial derivative of each of the above risk functions, and get each corresponding gradient

This represents the J component of the first sample point, i.e. h (θ).

Next, because we want to minimize the risk function, we update each one in the negative gradient direction of each parameter.

The alpha here indicates the step size

From the above formula can be noted that it is a global optimal solution, but each iteration step, will be used to the training set all the data, if M is large, then imagine this method of iteration speed!! So, this introduces another method, a random gradient descent.

random gradient descent sgd
Because the batch gradient descent is very slow in the case of a large training set, it is not feasible to use the batch gradient descent to solve the optimization problem of the risk function in this case, and in this case, the stochastic gradient descent is proposed.
We rewrite the risk function described above into the following form:

which

A loss function called a sample point

Next we have a loss function for each sample, to each of the partial derivative, to get each corresponding gradient

Each parameter is then updated according to the negative gradient direction of each

Compared to the batch gradient descent, the random gradient descent uses only one sample per iteration, and in the case of large sample sizes, it is common to iterate the θ to the optimal solution using only a subset of the sample data. Therefore, the decrease of random gradient is much less than that of batch gradient reduction in computational amount.

One drawback of SGD is that it is more noisy than BGD, making SGD not each iteration toward the overall optimization direction. and SGD because each time uses a sample to iterate, so the final optimal solution is often not the global optimal solution, but only the local optimal solution. But the direction of the large whole is to the global optimal solution, the final result is often near the global optimal solution.

Here are the graphical presentations of the two methods:

As can be seen from the above figure, SGD because each time a sample point gradient search, so its optimal path looks more blind (this is also the origin of the random gradient descent name).

Compare the advantages and disadvantages of the following points:
Batch gradient drop:
Advantages: Global optimal solution, easy to implement in parallel, not much iteration number
Disadvantages: When the number of samples is large, the training process is slow, and each iteration takes a lot of time.

random gradient descent:
Advantages: Fast training speed with little calculation per iteration
Disadvantages: Reduced accuracy, not global optimal, not easy to implement in parallel, the total number of iterations is more.

============ Split Segmentation =============
Above we explain what is gradient descent and how to solve the gradient descent, we will use Python to achieve gradient descent method.

[Python]View PlainCopy

# _*_ Coding:utf-8 _*_
# Author: Yhao
# Blog: http://blog.csdn.net/yhao2014
# email: [Email protected]
# Training Set
# 3 components per sample point (X0,X1,X2)
x = [(1, 0., 3), (1, 1., 3), (1, 2., 3), (1, 3. , 2), ( 1, 4., C14>4)]
# Y[i] Output for sample points
y = [95.364, 97.217205, 75.195834, 60.105519, 49.342380]
# iterate threshold, stop iteration when the difference between the two iterations loss function is less than the threshold value
Epsilon = 0.0001
# Learning Rate
Alpha = 0.01
diff = [0, 0]
Max_itor = +
Error1 = 0
Error0 = 0
CNT = 0
m = Len (x)
# Initialization Parameters
THETA0 = 0
Theta1 = 0
Theta2 = 0
While True:
CNT + = 1
# parameter Iteration calculation
For I in range (m):
# fit function = y = theta0 * x[0] + theta1 * x[1] +theta2 * x[2]
# Calculate Residuals
diff[0] = (theta0 + theta1 * x[i][1] + theta2 * x[i][2])-y[i]
# gradient = diff[0] * X[i][j]
THETA0-= Alpha * diff[0] * x[i][0]
THETA1-= Alpha * diff[0] * x[i][1]
THETA2-= Alpha * diff[0] * x[i][2]
# Calculate Loss function
Error1 = 0
For LP in Range (len (x)):
Error1 + = (y[lp]-(theta0 + theta1 * x[lp][1] + theta2 * x[lp][2])) * *2/2
if ABS (ERROR1-ERROR0) < epsilon:
Break
Else:
Error0 = Error1
print ' theta0:%f, Theta1:%f, Theta2:%f, Error1:%f '% (theta0, theta1, Theta2, Error1)
Print ' done:theta0:%f, Theta1:%f, theta2:%f '% (theta0, theta1, Theta2)
Print ' iteration count:%d '% cnt

Results (interception section):

[Plain]View PlainCopy

theta0:2.782632, theta1:3.207850, theta2:7.998823, error1:7.508687
theta0:4.254302, theta1:3.809652, theta2:11.972218, error1:813.550287
theta0:5.154766, theta1:3.351648, theta2:14.188535, error1:1686.507256
theta0:5.800348, theta1:2.489862, theta2:15.617995, error1:2086.492788
theta0:6.326710, theta1:1.500854, theta2:16.676947, error1:2204.562407
theta0:6.792409, theta1:0.499552, theta2:17.545335, error1:2194.779569
theta0:74.892395, Theta1: -13.494257, theta2:8.587471, error1:87.700881
theta0:74.942294, Theta1: -13.493667, theta2:8.571632, error1:87.372640
theta0:74.992087, Theta1: -13.493079, theta2:8.555828, error1:87.045719
theta0:75.041771, Theta1: -13.492491, theta2:8.540057, error1:86.720115
theta0:75.091349, Theta1: -13.491905, theta2:8.524321, error1:86.395820
theta0:75.140820, Theta1: -13.491320, theta2:8.508618, error1:86.072830
theta0:75.190184, Theta1: -13.490736, theta2:8.492950, error1:85.751139
theta0:75.239442, Theta1: -13.490154, theta2:8.477315, error1:85.430741
theta0:97.986390, Theta1: -13.221172, theta2:1.257259, error1:1.553781
theta0:97.986505, Theta1: -13.221170, theta2:1.257223, error1:1.553680
theta0:97.986620, Theta1: -13.221169, theta2:1.257186, error1:1.553579
theta0:97.986735, Theta1: -13.221167, theta2:1.257150, error1:1.553479
theta0:97.986849, Theta1: -13.221166, theta2:1.257113, error1:1.553379
theta0:97.986963, Theta1: -13.221165, theta2:1.257077, error1:1.553278
done:theta0:97.987078, Theta1: -13.221163, theta2:1.257041
Iteration count: 3443

You can see the final convergence to a stable parameter value.

Note: This is a careful choice when selecting Alpha and epsilon, which may cause the final failure to converge.

Reference Documentation:

A gradient descent algorithm for Python

(turn) Gradient descent method and its Python implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More