Andrew ng Machine learning Note 2--Gradient descent method and least squares fitting

Last Update:2018-07-26 Source: Internet

Author: User

Tags andrew ng machine learning

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today formally began to learn the machine learning algorithm, the teacher first cited an example: a region of the house area and the price of a data set, then how to predict the price of a given housing area. What most of us can think of is to draw a scatter plot of the house area and price, and then fit the price on the area curve, then for a known housing area, you can get the predicted price on the fitted curve. The problem is to return .

To solve this problem mathematically, you must define a bunch of symbols to describe the problem, the following is the definition of the symbol :

The symbols are defined, to see What we need to solve:

First find a training sample set, provided to the learning algorithm, the algorithm will generate an output function, we use H (suppose) to represent the function, the function is to accept an input, and output an estimate of the real value (output estimates), that is, Maps the input to an estimate . Next we need to decide how to express this hypothesis , in order to facilitate the analysis, the price of the relationship between the housing area is assumed to be linear relationship , this is a linear regression problem ~

Based on the above description, we need to select the appropriate θ so that given an input x, the function h is assumed to be a good estimate of the real value y. When the number of samples is m, the problem can be expressed as:

In order to take a derivative later can be about 2, usually to the formula multiplied by a 1/2, at this time, ask the solution of the problem can be expressed in the following formula:

On how to find the θ that makes the objective function minimum, the teacher speaks two methods: least squares and gradient descent.

The gradient descent method is introduced first:

As shown in the figure, if you are standing at this point of the star of the cross, look around for a week, and then ask yourself, if only one small step, which way to go can make me the fastest downhill. Gradient descent algorithm is the way to work, the direction of walking is actually gradient direction. Until you get to the local minimum value of a function. At this point, go back to the points identified by the original Doji star, select a point on the right side of this point, and continue walking and encounter a local minimum value. --The gradient descent algorithm sometimes relies on the initial value of the parameter.

Question:

-How to look around 360 degrees a week and find the fastest way to fall.

-actually did not look around for a week, only to calculate the partial derivative of the function, because the minimum value, so the gradient direction is the opposite direction of the partial derivative . For the questions mentioned above:

Given the initial value of θ, with the initial value of θ, we can get the initial hypothetical H function (because the H function is related to the characteristic of the input sample and θ), and then we get the initial output about the input relation function, then, for each training sample, we can calculate the squared sum of the deviation of the estimate and the real value, accumulate, The current target function value is given, and secondly, the new value is assigned to Theta, which subtracts the gradient of the function at that point on the original θ, calculates the value of the target function again, iterates over the process until the target function obtains the local minimum value. The formula derivation process is described below:

First, there is only one set of training samples:

When the training sample is extended to the M group (m>1):

For vector θ, the dimension of the vector is equal to the dimension of the sample feature, and each dimension component θi can find the direction of a gradient, and we can look for a whole direction, which is the quickest direction of descent.

In the algorithm above, there are two parts that need to be iterated: one is to traverse all training samples, and the other is to traverse each dimension of the vector θ for each training sample in order to get the overall descent in the fastest direction.

Random Gradient descent Method:

You can use the following pseudo-code to represent the principle of the random gradient descent method

least squares quasi-legality:

To be able to deduce the following formula, some knowledge is required to prepare:

Formula Derivation process:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More