contents of this lesson:1. Linear regression2. Gradient Descent3, the normal equation groupsupervised learning: Tell the correct answer to each sample of the algorithm, and the learning algorithm can enter the correct answer for the new input .
1. Linear regressionProblem Introduction: Suppose there is a home sales data as follows:
introduce common symbols:m = number of training samplesx = input variable (feature)y = output variable (target variable)(x, y)-one sample ith-i training sample =(x(i), y(i))
in this example: M: Number of data, X: House size, y: Price
supervise the learning process:1) Provide training samples to the learning algorithm2) The algorithm generates an output function (usually expressed as h, which is assumed)3) This function receives input and outputs the result. (In this case, the receiving house area, the output rate) maps x to Y. as shown in the following:
linear representation of assumptions:
generally speaking, regression problems have multiple input characteristics. In the example above, we also know that the number of bedrooms in a house is a second characteristic. That is, X1 represents the size, X2 represents the number of bedrooms, you can write the hypothesis:
in order to write the formula neatly and define X0=1, H can be written as:
n= characteristic number, θ: parameter
The purpose of selecting θ is to minimize the squared difference between H (x) and Y. And because there are m training samples, we need to calculate the square difference of each sample, and for the convenience of derivation later, multiplied by 1/2, that is: We have to do is to ask: Min (J (θ)) Methods for min (θ) are: Gradient descent method and normal equation group method
2. Gradient descent (Gradient descent) gradient descent is a search algorithm, the basic idea: first give the parameter vector an initial value, such as 0 vector, constantly changing, so that J (θ) is shrinking. method of change: Gradient descentthe horizontal axes represent θ0 and θ1, and the vertical coordinates represent J (θ). Assuming that the three-dimensional graph is a three-dimensional surface, the selected initial vector points are located on a "mountain". the gradient descent method is that you look around for a week, looking for the fastest descent path, which is the direction of the gradient, one small step at a time, and then the surrounding, continue to decline, and so on. The result reaches a local minimum value, such as:
of course, if the initial point selection is different, the result may be another completely different local minimum, as shown in:
The result of the gradient descent is dependent on the initial value of the parameter.
mathematical representation of the gradient descent algorithm:
: = is an assignment operator, which represents an assignment statement in a program. each time the θi minus the θi to the J (θ) result, that is, the steepest "hillside" down Assuming that there is only one training sample, the partial derivative is expanded:
the substituting formula has:
α: The speed of study determines how big each step is when you descend. Set too small, the convergence time is long, set too large, may be at the time of stepping over the minimum value.
Special Note:θi must be updated synchronously, even if assume i=0 and I=1, then follow the steps below when updating:
(1) batch gradient descent algorithm:The above is a formula for processing a training sample, which is derived into an algorithm containing M training samples, which is looped down until convergence:
Analysis of Complexity:For each iteration of each θi, as shown in the above, the time is O (m) each iteration (step) requires the calculation of the gradient value of n features, and the Complexity of O (MN) In General, this two-time function of J (θ) of the three-dimensional graphics is a bowl-shaped, with a unique global minimum value. Its contour is a set of one oval, the use of gradient drop will quickly converge to the center.
Although the value of α is fixed, the gradient descent algorithm converges to the local minimum, because each time the subtraction is multiplied by the gradient, but the gradient becomes smaller and smaller, so the steps are less. The curve of the house size and price for the use of gradient descent fittingmethods for detecting convergence:1) Detection of two iterations of the θi change, if no longer change, then the determination of convergence2) More commonly used methods: Test J (θ), if no longer change, the determination of convergencethe advantage of the batch gradient descent algorithm is that the local optimal solution can be found, but if the training sample m is large, every iteration of the sample must calculate the derivative of all the samples and the time is too slow, so the following gradient descent method is used. (2) random gradient descent algorithm (incremental gradient descent algorithm):
as a result, each training sample is updated once θi until convergent, faster than the batch gradient descent method, because the batch gradient descent method needs to traverse all the samples each time the θi is updated. that is
, in the batch gradient descent, take one step to consider the M samples; in a random gradient descent, one step is considered for a sample . the complexity of each iteration is O (n). When the M sample is exhausted, continue to loop to the 1th sample. The above uses the iterative method to find the minimum value, in fact, for this kind of specific least squares regression problem, or ordinary least squares problem, there are other methods to give the minimum value, then this method can give the analytic expression of the parameter vector, so that there is no need for iterative solution. 3. Normal equation set (normal equations)
Suppose we have a sample of M. The dimension of the eigenvector is N. Therefore, the sample is {(x(1), Y (1)), (x (2), Y (2)),... ..., (x (m), Y (m))}, where x (i) is x (i) in each sample ={x1 (i), xn (i),... ..., xn (i)}. Make H (θ) =θ0 +θ1x1 +θ2x2 + ... +θnxn, there is
If you want H (θ) =y, you have x θ= Y
Let's recall the two concepts: the inverse of the unit matrix and the matrix, to see what their nature is.
(1) Unit matrix E
ae=ea=a
(2) inverse of the matrix A -1
Requirement: A must be a phalanx
Property:aa-1=a-1a=e
and look at the x θ= Y . ; If you want to find theta, then we need to do some conversion:
Step1: First turn the Matrix on the left of θ into a square. By multiplying XT can be achieved, then there is
X T x θ= XT Y
Step2: Turn the left part of Theta into a unit matrix so that it disappears into the invisible ...
(xTx)-1 (X TX) · θ=(XTX)-1XTY
Step3 : Due to (X t -1 x t x) =e , so the formula becomes
eθ= (xTx) -1 X T Y
E can be removed, thus getting
θ= (xTx) -1 X T Y
This is what we call the normal equation.
Normal equation vsgradient descentnormalequation, like Gradient descent, can be used to calculate the weight vector θ. But compared with gradientdescent, it has both advantages and disadvantages.
Advantages:Normalequation can not be the size of the meaning X feature. For example, there are eigenvectors x={x1, X2}, where the range of X1 is 1~2000, and X2 's range is 1~4, you can see that their range is 500 times times the difference. If the Gradientdescent method is used, it can cause the ellipse to become very narrow and very long, but the gradient descent is difficult, even the gradient can not be lowered (because the derivative may be rushed out of the ellipse after multiplying the step). However, if you use the Normalequation method, you do not have to worry about this problem. Because it is purely a matrix algorithm.
Disadvantages:compared to gradient descent,normalequation, a large number of matrix operations are needed, especially the inverse of matrices. In the case of a large matrix, the computational complexity and the requirements for the memory capacity of the computer are greatly increased. Learning regression problem can not avoid gradient problem. Before the concept of gradient has been blurred, find a lot of blog to read, and finally smattering.
my simple understanding of the gradient descent method
(1) Why is the direction introduced in the study of multivariate function independent variables?in the case of a one-dimensional argument, where the argument can be considered a scalar, a real number can represent it at this time, if you want to change the value of the argument, it either decreases or increases, that is, "not left or right." Therefore, when it comes to the concept of "moving an argument in a certain direction", it is not very obvious;in the case of the argument N (n≥2), the concept is useful: Assume that the argument x is 3-dimensional, that is, each x is a point (x1,x2,x3), where x1,x2 and X3 are a real number, that is, a scalar. So, if you want to change X and move one point to another, how do you move? There are too many ways to choose, for example, we can make the x1,x2 unchanged, only make the X3 change, but also can make x1,x3 unchanged, only x2 change, and so on. These practices also allow us to have the concept of "direction", because in 3-dimensional space, one point moves to another point, not as "left-to-right" as in one-dimensional case, but with "direction". In this case, finding a suitable "direction" to move from one point to another makes it very necessary to change the function value most in line with our predetermined requirements (for example, to what extent the function values are reduced). (2) Why does the inverse direction function value of the gradient fall the fastest? set the target function f (x) at the point xk at Taylor expansion:Xk: An argument representing the K-point (a vector). D: Unit direction (a vector), i.e. |d|=1. Alpha: Step Length (a real number). :The objective function is at the gradient of this point (a vector) of XK. O (α): Higher order infinitesimal of α . The mathematical expression was unfolded using Taylor's formula, and looked a bit ugly, so we compared the Taylor expansion in the case of a one-dimensional argument.
You know what's going on with the Taylor expansion in multidimensional situations.
in the [1] type, the higher order infinitesimal can be ignored, so the [1] type is taken to the minimum value,should maketake the minimum-this is the dot product (quantity product) of two vectors, and in what case is the value minimal? look at the two vectors .how the cosine of the angle θ is defined:
assuming that the angle between vector d and negative gradient-gk is θ, we can find the value of the dot product as:
visible,θ is 0 o'clock, and the minimum value is obtained on the upper type. That is, when D takes-gk, the target function value drops fastest, which is called the negative gradient direction of the "steepest descent" direction of origin. The gradient direction of a multivariate function is the steepest direction in which the value of the function is increased. When specific to a unary function, the gradient direction is first along the tangent of the curve, and then the direction of the tangent upward growth is the gradient direction. As shown in.
In the case of a two-tuple function or multivariate function, the gradient vector is the derivative of the function value F for each variable, and the direction of the vector is the direction of the gradient. As shown。
The arrow direction in the figure is negative gradient direction.
The end!
Stanford University Machine Learning public Class (II): Supervised learning application and gradient descent