Machine Learning-Stanford: Learning Note 2-supervised learning application and gradient descent

Source: Internet
Author: User

Supervised learning application and gradient descent

Contents of this lesson:

1. Linear regression

2. Gradient Descent

3. Normal equations

(review) Supervised learning: Tell the correct answer to each sample of the algorithm, and the learning algorithm can enter the correct answer for the new input.

1. Linear regression

Example: Alvin car, first let people drive, Alvin Camera Watch (Training), and then realize automatic driving.

The essence is a regression problem, and the car tries to predict the direction of travel.

Example: House size and price data set for the previous lesson

Introduce common symbols:

m = Number of training samples

x = input variable (feature)

y = output variable (target variable)

(x, Y) – one sample

– I Training sample =

In this example: M: Number of data, X: House size, Y: Price

Supervise the learning process:

1) Provide training samples to the learning algorithm

2) The algorithm generates an output function (usually expressed as h, which is assumed)

3) This function receives input and outputs the result. (In this case, the receiving house area, the output rate) maps x to Y.

As shown in the following:

Linear representation of assumptions:

Generally speaking, regression problems have multiple input characteristics. In the example above, we also know that the number of bedrooms in a house is a second characteristic. That is, size, indicating the number of bedrooms, you can write the hypothesis as:

In order to write the formula neatly and define it, H can be written as:

n = number of features,: parameter

The purpose of the selection is to minimize the squared difference between H (x) and Y. And because of the M training samples, we need to calculate the squared difference for each sample, and then multiply the result by 1/2 to simplify the results, namely:

All we have to do is ask: Min (J ())

Min (J ()) method: Gradient descent and formal equation of equations

2. Gradient Descent

Gradient descent is a search algorithm, the basic idea: first give the parameter vector an initial value, such as 0 vector, constantly changing, so that J () is shrinking.

Method of change: Gradient descent

, horizontal axis representation, vertical coordinate representation J ()

Initially select the 0 vector as the initial value, assuming that the three-dimensional graph is a three-dimensional surface, the 0-Vector point is located on a "mountain". The gradient descent method is that you look around for a week, looking for the fastest descent path, which is the direction of the gradient, one small step at a time, and then the surrounding, continue to decline, and so on. The result reaches a local minimum value, such as:

Of course, if the initial point is different, the result may be another completely different local minimum, as follows:

The result of the gradient descent is dependent on the initial value of the parameter.

Mathematical representation of the gradient descent algorithm:

is an assignment operator, which represents an assignment statement in a program.

Each time, the result of the bias is subtracted, which is the descent along the steepest "hillside"

Spread the partial derivative analysis:

On-behalf:

: The speed of learning, that is, you decide how big each step of the mountain. Set too small, the convergence time is long, set too large, may exceed the minimum value

(1) Batch gradient descent algorithm:

The above is a formula for processing a training sample, which is derived into an algorithm containing M training samples, which is looped down until convergence:

Analysis of Complexity:

For each iteration of each, as shown in the above, the time is O (m)

Each iteration (step) requires the calculation of the gradient value of n features, and the Complexity of O (MN)

In general, this two-time function of the three-dimensional graphics for a bowl-shaped, there is a unique global minimum value. Its contour is a set of one oval, the use of gradient drop will quickly converge to the center.

Gradient descent properties: When approaching convergence, each step becomes smaller. The reason for this is that each time the subtract is multiplied by the gradient, but the gradient is getting smaller and smaller.

The curve of the house size and price for the use of gradient descent fitting

Methods for detecting convergence:

1) Detection of two iterations of the change, if no longer change, the determination of convergence

2) more commonly used methods: test, if no longer change, the determination of convergence

The advantage of the batch gradient descent algorithm is that the local optimal solution can be found, but if the training sample m is large, every iteration of the sample must calculate the derivative of all the samples and the time is too slow, so the following gradient descent method is used.

(2) Random gradient descent algorithm (incremental gradient descent algorithm):

Each calculation does not need to traverse all the data, but only the sample I can be computed.

That is , in the batch gradient descent, take one step to consider the M samples; in a random gradient descent, one step is considered for a sample .

The complexity of each iteration is O (n). When the M sample is exhausted, continue to loop to the 1th sample.

The above uses the iterative method to find the minimum value, in fact, for this kind of specific least squares regression problem, or ordinary least squares problem, there are other methods to give the minimum value, then this method can give the analytic expression of the parameter vector, so that there is no need for iterative solution.

3. Normal equations

Given a function j,j is a function of a parameter array, defining the gradient of J about the derivative, which itself is also a vector. The vector size is n+1 dimensions (from 0 to N), as follows:

Therefore, the gradient descent algorithm can be written as:

More generally, the function of a function f,f is to map a m*n matrix to a real space, namely:

Assuming the input is a m*n size matrix A, define the derivative of f on matrix A as:

The derivative itself is also a matrix that contains the partial derivative of each element of f about a.

If A is a square, i.e. a matrix of n*n, the trace of a is defined as the sum of the diagonal elements of a, namely:

TRA is the simplification of TR (A).

Some theorems about trace operators and derivatives:

1) TrAB = Trba

2) trabc = Trcab = TRBCA

3)

4)

5) If, tra = a

6)

With the above properties, you can begin to deduce:

The definition matrix X, called the Design matrix, contains the matrix of all the inputs in the training set, and the input data for the group I behavior, namely:

As a result, it is possible to:

And because for Vector z, there are:

The last of these properties may be:

Through the above 6 properties, deduce:

In the penultimate line, use the last property

will be set to 0, there are:

Called the normal equations.

You can get:

Machine Learning-Stanford: Learning Note 2-supervised learning application and gradient descent

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.