Coursera "Machine learning" Wunda-week1-03 gradient Descent algorithm

Coursera "Machine learning" Wunda-week1-03 gradient Descent algorithm _ machine learning

Last Update:2018-08-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Gradient descent algorithm minimization of cost function J gradient descent
Using the whole machine learning minimization first look at the General J () function problem
We have J (θ0,θ1) we want to get min J (θ0,θ1) gradient drop for more general functions
J (Θ0,θ1,θ2 .....) θn) min J (θ0,θ1,θ2 .....) Θn) How this algorithm works. : Starting from the initial assumption
Starting from 0, 0 (or any other value) to keep a little bit of change θ0 and θ1, to try to reduce each parameter change, you can choose this to reduce the gradient J (θ0,θ1) the most likely repetition of doing so until you converge to the local low point has an interesting character
Your starting position will determine your final lowest point.
Here we can see that an initialization point produces a local minimum value and another initialization point produces another local minimum. The formal definition repeats the following steps until the convergence
What does that mean.
The derivative symbol for ΘJ by setting the θj to Θj minus the local cost function of alpha times
: =
Indicates that the assignment NB a = B is the true assertion Alpha (Alpha)
It's a digital control called learning rate. How much update do you take?
If Alpha is large, it has a larger descent rate, if alpha is small, the magnitude is smaller derivative

-then explain in detail a clever implementation of gradient descent algorithm
The following processing θ0 and θ1 for j = 0 and J = 1 means that the values of θ0 and θ1 are also updated to implement.
Calculates the right-hand side of the equation for θ0 and θ1
So we need a temp value and then, to update θ0 and θ1,
We graphically display
If you implement an asynchronous update, it is not a gradient drop, and an exception occurs
But it may seem right-so it's important to keep this in mind. Understanding Algorithms
To understand the gradient descent, we will return a simpler function, and we minimize a parameter to help explain the algorithm in more detail when θ1 is a real number, two key terms in the minθ1 J (θ1) algorithm
Alpha (Alpha) nuance
-Bias and derivative
-When we have multiple variables, but only relative to a variable export, use the partial derivative
-When we export relative to all variables, we use derivative differential

The derivative shows a tangent at this point and observes the tangent slope so moving downward generates a negative derivative, alpha is always positive. So update J (θ1) to a smaller value. Similarly, moving up will make J (θ1) larger alpha (Alpha)
What happens if alpha is too small or too big
Too small
The step is too small, too time-consuming, too big to miss the minimum, and ultimately not convergent when you get a local minimum value
where the tangent slope \ derivative is 0 so the differential is also 0 alpha*0=0 then the θ1=θ1-0θ1 will remain unchanged when you approach the global minimum, the derivative items become smaller, so even if alpha is fixed, your update will become smaller
As the algorithm runs, when you approach the minimum, you take smaller steps so that you do not need to change the linear regression of the alpha gradient decline over time to minimize the value of the square difference cost function J (θ0,θ1) Now we have a partial derivative
Now we expand the first pair of expressions,
J (θ0,θ1) = 1/2m ... hθ (x) =θ0 +θ1*x When we need to determine the derivative of each parameter:
When j = 0 when j = 1 identifies the partial derivative of the needle for θ0 and θ1
When we derivative this expression according to J = 0 and J = 1, we get the following results
To check this, you need to know multivariate calculus
So we can reinsert these values into the gradient descent algorithm and how it works.
It is always a convex function to encounter different local optimal risk linear regression cost function-always have a minimum value
A global optimum for a bowl-shaped surface
Therefore, gradient descent always converges to the global optimal
Initialization in Practice:
θ0 = 900θ1 =-0.1

At the end of the global minimum
This is actually a batch gradient drop that means that in each step you can view all the training data
Calculate m training samples per step Sometimes, non-batch versions exist, they view small subsets of data
We will study the other forms of gradient descent (used when M is too large) at the back of the course to find the numerical solution of the solution for finding the minimum function
Regular equation method gradient descent can be better extended to large datasets for a large number of contexts and machine learning next-important extensions

The regular equation of extended numerical solution of two algorithms in order to solve the minimization problem of [min J (θ0,θ1)], we use the exact numerical method rather than the constant iterative gradient descent method with the advantages and disadvantages of the regular equation method
Advantages
No longer the alpha term for some of the problems can be a little quick disadvantage
More complex we can learn more functions

We can learn more functions so there may be other parameters to help the price
For example with housing
Size age bedroom number of floors number x1,x2,x3,x4 have multiple functions become difficult to draw
Cannot really draw above three-dimensional symbols also become more complex
The best way to circumvent this is to provide symbols for linear algebra and a series of things such as matrices that can use matrices and vectors.

Here we see this matrix that shows us
Size bedroom number floors number of family age all data is in a variable
A number block that organizes all the data into a large vector
Show Y show us prices need linear algebra to get more complex linear regression model linear algebra is good for making computationally efficient models (as described later)
Provides a good way to use a large number of datasets often, the quantification of problems is a common optimization technique

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Coursera "Machine learning" Wunda-week1-03 gradient Descent algorithm _ machine learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Coursera "Machine learning" Wunda-week1-03 gradient Descent algorithm _ machine learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support