Machine learning Knowledge Point 04-Gradient descent algorithm

Last Update:2018-07-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

gradient Descent is an algorithm used to find the minimum value of a function, and we will use the gradient descent algorithm to find the minimum value of the cost function J (θ0,θ1) .

The idea behind the gradient drop is that at first we randomly select a combination of parameters (θ0,θ1,..., partθn), calculate the cost function, and then we look for the next parameter combination that will lower the value of the cost function. We continue to do this until a local minimum (local minimum) is reached, because we are not trying to complete all the parameter combinations, so we cannot determine whether the local minimum we get is the global minimum value (global minimum), choose a different initial parameter combination, Different local minimum values may be found.

Imagine that you are standing at this point in the mountains, standing on the Red Hill of the park you imagined, in the gradient descent algorithm, what we have to do is rotate 360 degrees, look around us, and ask ourselves in some direction, with small rags as soon as possible down the hill. What direction do these little rags need to go? If we stand on the hillside this point, you look around, you will find the best downhill direction, you look around, and then once again think, I should go from the direction of the small rags down the hill? Then you follow your own judgment and take another step, repeating the steps above, from this new point, You look around and decide in what direction you will go downhill the fastest, then take a small step, and so on, until you are near the local lowest point.

The formula for the batch gradient descent (batch gradient descent) algorithm is:

Where α is the learning rate (learning rates), it determines how much we can go down in the direction where the cost function is most likely to fall, and in the batch gradient drop, Every time we have all the parameters minus the learning rate multiplied by the derivative of the cost function.

In the gradient descent algorithm, there is also a more subtle problem in the gradient drop in which we want to update θ0 and θ1, when J=0 and j=1 will produce updates, so you will update jθ0 and jθ1. The subtlety of implementing the gradient descent algorithm is that, in this expression, if you want to update the equation, you need to update both θ0 and θ1, I mean in this equation, we're going to update this:

θ0:=θ0, and update θ1:=θ1. You do this by calculating the right part of the formula, calculating the values of θ0 and θ1 from that part, and then updating both θ0 and θ1.

In the gradient descent algorithm, this is the correct way to implement simultaneous updating. I'm not going to explain why you need to update at the same time, and the update is a common method in gradient descent. As we'll talk about later, synchronous updating is a more natural way to implement it. When people talk about gradients falling, they mean synchronizing updates. ( Note: The correct implementation of synchronization this sentence means not to calculate the temp0 after the update θ0; then in the calculation temp1, update θ 1)

Deep understanding of gradient descent:

We give a mathematical definition of gradient descent, and we delve deeper into what this algorithm does, and what it means to update the gradient descent algorithm. Gradient descent algorithms such as:

Description: Assigning a value to θ causes J (θ) to proceed in the quickest direction of the gradient descent, iterating all the way down and finally getting local minimum values. where α is the learning rate (learning), it determines how much we can go down in the direction that allows the cost function to fall in the most ways.

For this problem, the purpose of the derivation, basically can be said to take this red dot tangent, is such a red line, just like the function tangent to this point, let us look at the slope of the red line, this is exactly the curve tangent to the function of this line, The slope of this line is exactly the height of the triangle divided by this horizontal length, now, this line has a positive slope, that is, it has a positive derivative, so I get the new θ1,θ1 update is equal to θ1 minus a positive number multiplied by α.

Let's take a look at what happens if α is too small or alpha too large :

If α is too small , that is, my learning rate is too small, the result is just like a small baby to move a little bit, to try to approach the lowest point, so it takes a lot of steps to reach the lowest point, so if α is too small, it may be very slow because it will move a little bit, it will It takes a lot of steps to reach the global lowest point. If α is too large , then the gradient descent method may cross the lowest point and may not even converge, and the next iteration moves a big step, over and over again, crossing the lowest point again and again until you find that it is actually getting farther from the lowest point, so if α is too large, It can lead to the inability to converge or even diverge. Now, I have another question, if we put θ1 at a local low point in advance , how do you think the next gradient descent method will work? Suppose you initialize the θ1 at the local lowest point, where it is already at a local optimal or local minimum. The result is that the derivative of the local optimal point will be equal to zero , because it is the slope of the tangent line. This means that you are already in the local optimal point, it makes θ1 no longer change, that is, the new θ1 equals the original θ1, so if your parameter is already at the local low, then the gradient descent method update actually did nothing, it does not change the value of the parameter. This also explains why gradient drops can converge to local lows even when the learning rate α remains constant.

Let's look at an example, which is the cost function J (θ).

I want to find its minimum value, first initialize my gradient descent algorithm, initialize it at that Magenta point, and if I update one step gradient down, maybe it will take me to this point because the derivative of this point is quite steep. Now, at this Green point, if I update one more step, you will find that my derivative, also known as the slope, is not so steep. As I approach the lowest point, my derivative is getting closer to 0, so the new derivative will become smaller a little bit after the gradient has dropped one step . Then I want to drop the gradient one step further, at this Green point, I will naturally use a slightly smaller step with the magenta point, to the new red point, closer to the global minimum, so this derivative will be smaller than the green point. So, when I go one step further down the gradient, my derivative term is smaller, and the amplitude of the θ1 update will be smaller. So as the gradient descent method runs, the amplitude of your move will automatically become smaller until the final movement amplitude is very small, and you will find that it has converged to local minima.

Looking back , in gradient descent, when we approach the local lowest point, the gradient descent method will automatically take a smaller amplitude, because when we approach the local lowest point, it is clear that the local lowest derivative equals zero, so when we approach the local minimum, the guide value will automatically become smaller, So gradient descent will automatically take a smaller amplitude, which is the practice of gradient descent. So there is actually no need to reduce the alpha in addition. This is the gradient descent algorithm, which you can use to minimize any cost function J, not just the cost function j in linear regression.

In the following essay, we will use the cost function J to return to its essence, the cost function in linear regression. That is, the square error function that we get earlier, combined with the gradient descent method, and the square cost function, we will draw the first machine learning algorithm, that is, the linear regression algorithm

Machine learning Knowledge Point 04-Gradient descent algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine learning Knowledge Point 04-Gradient descent algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine learning Knowledge Point 04-Gradient descent algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support