Supervised learning application and gradient descent
Contents of this lesson:
1. Linear regression
2. Gradient Descent
3. Normal equations
(review) Supervised learning: Tell the correct answer to each sample of the algorithm, and the learning algorithm can enter the correct answer for the new input.
1. Linear regression
Example: Alvin car, first let people drive, Alvin Camera Watch (Training), and then realize automatic driving.
The essence is a regression problem, and the car tries to predict the direction of travel.
Example: House size and price data set for the previous lesson
Introduce common symbols:
m = Number of training samples
x = input variable (feature)
y = output variable (target variable)
(x, Y) – one sample
– I Training sample =
In this example: M: Number of data, X: House size, Y: Price
Supervise the learning process:
1) Provide training samples to the learning algorithm
2) The algorithm generates an output function (usually expressed as h, which is assumed)
3) This function receives input and outputs the result. (In this case, the receiving house area, the output rate) maps x to Y.
As shown in the following:
Linear representation of assumptions:
Generally speaking, regression problems have multiple input characteristics. In the example above, we also know that the number of bedrooms in a house is a second characteristic. That is, size, indicating the number of bedrooms, you can write the hypothesis as:
In order to write the formula neatly and define it, H can be written as:
n = number of features,: parameter
The purpose of the selection is to minimize the squared difference between H (x) and Y. And because of the M training samples, we need to calculate the squared difference for each sample, and then multiply the result by 1/2 to simplify the results, namely:
All we have to do is ask: Min (J ())
Min (J ()) method: Gradient descent and formal equation of equations
2. Gradient Descent
Gradient descent is a search algorithm, the basic idea: first give the parameter vector an initial value, such as 0 vector, constantly changing, so that J () is shrinking.
Method of change: Gradient descent
, horizontal axis representation, vertical coordinate representation J ()
Initially select the 0 vector as the initial value, assuming that the three-dimensional graph is a three-dimensional surface, the 0-Vector point is located on a "mountain". The gradient descent method is that you look around for a week, looking for the fastest descent path, which is the direction of the gradient, one small step at a time, and then the surrounding, continue to decline, and so on. The result reaches a local minimum value, such as:
Of course, if the initial point is different, the result may be another completely different local minimum, as follows:
The result of the gradient descent is dependent on the initial value of the parameter.
Mathematical representation of the gradient descent algorithm:
is an assignment operator, which represents an assignment statement in a program.
Each time, the result of the bias is subtracted, which is the descent along the steepest "hillside"
Spread the partial derivative analysis:
On-behalf:
: The speed of learning, that is, you decide how big each step of the mountain. Set too small, the convergence time is long, set too large, may exceed the minimum value
(1) Batch gradient descent algorithm:
The above is a formula for processing a training sample, which is derived into an algorithm containing M training samples, which is looped down until convergence:
Analysis of Complexity:
For each iteration of each, as shown in the above, the time is O (m)
Each iteration (step) requires the calculation of the gradient value of n features, and the Complexity of O (MN)
In general, this two-time function of the three-dimensional graphics for a bowl-shaped, there is a unique global minimum value. Its contour is a set of one oval, the use of gradient drop will quickly converge to the center.
Gradient descent properties: When approaching convergence, each step becomes smaller. The reason for this is that each time the subtract is multiplied by the gradient, but the gradient is getting smaller and smaller.
The curve of the house size and price for the use of gradient descent fitting
Methods for detecting convergence:
1) Detection of two iterations of the change, if no longer change, the determination of convergence
2) more commonly used methods: test, if no longer change, the determination of convergence
The advantage of the batch gradient descent algorithm is that the local optimal solution can be found, but if the training sample m is large, every iteration of the sample must calculate the derivative of all the samples and the time is too slow, so the following gradient descent method is used.
(2) Random gradient descent algorithm (incremental gradient descent algorithm):
Each calculation does not need to traverse all the data, but only the sample I can be computed.
That is , in the batch gradient descent, take one step to consider the M samples; in a random gradient descent, one step is considered for a sample .
The complexity of each iteration is O (n). When the M sample is exhausted, continue to loop to the 1th sample.
The above uses the iterative method to find the minimum value, in fact, for this kind of specific least squares regression problem, or ordinary least squares problem, there are other methods to give the minimum value, then this method can give the analytic expression of the parameter vector, so that there is no need for iterative solution.
3. Normal equations
Given a function j,j is a function of a parameter array, defining the gradient of J about the derivative, which itself is also a vector. The vector size is n+1 dimensions (from 0 to N), as follows:
Therefore, the gradient descent algorithm can be written as:
More generally, the function of a function f,f is to map a m*n matrix to a real space, namely:
Assuming the input is a m*n size matrix A, define the derivative of f on matrix A as:
The derivative itself is also a matrix that contains the partial derivative of each element of f about a.
If A is a square, i.e. a matrix of n*n, the trace of a is defined as the sum of the diagonal elements of a, namely:
TRA is the simplification of TR (A).
Some theorems about trace operators and derivatives:
1) TrAB = Trba
2) trabc = Trcab = TRBCA
3)
4)
5) If, tra = a
6)
With the above properties, you can begin to deduce:
The definition matrix X, called the Design matrix, contains the matrix of all the inputs in the training set, and the input data for the group I behavior, namely:
As a result, it is possible to:
And because for Vector z, there are:
The last of these properties may be:
Through the above 6 properties, deduce:
In the penultimate line, use the last property
will be set to 0, there are:
Called the normal equations.
You can get:
Machine Learning-Stanford: Learning Note 2-supervised learning application and gradient descent