Linear regression model

Recall the example from the first lesson that predicts the price per square unit of a house. In this example, we can draw a straight line and try to match the distribution trend of the data points. We already know that this is a regression problem, that is, predicting the output of successive values. In fact, this is a typical linear regression model. The reason for this definition is probably that the regression equation can be represented by a linear function.

We can assume that this linear function is:

This is a unary-once equation about X. The values of two of these parameters we do not know, to be solved according to the data in the training set. Here we define several concepts, the data we already have, that is, the corresponding data pairs of the housing area and the unit price, called the training set. X as the housing area, called the input, y as the House unit price, becomes the output. You can use (x, y) to represent a training sample. The number of training samples is defined as M. Like above, a model with only one variable x is called a single-variable linear regression model.

The process of solving this problem with the machine learning method is actually using the training algorithm to process the data in the training set, get our regression equation h, and then with the new data, we can use the regression equation h to calculate the value of the corresponding output y when we only know the input x. Here x is the size of the house and Y is the unit price per square unit of the House.

So how to solve the equation H (x) just now? Of course, before we intuitively judge with the naked eye, then the method of drawing a line is very inaccurate. The key to solving H (x) is actually to solve the value of two unknown parameters. Intuitively, our goal is to allow the H function to be evaluated so that the training set x corresponds to the y of the H function as close as possible to the true Y value, the closer the better. We can calculate the parameters of the H function according to this criterion.

Therefore, we introduce the concept of cost function.

The cost function J in the above formula is the mathematical expression of the "purpose" we just said. First we calculate the training set X according to the H function corresponding to the Y value, minus the true Y value of the difference. Because this difference may be positive or negative, and we do not care about the positive or negative, only care about the size of the difference, so the square, so it is positive. Then we put all the data in the training set so we can add up the squared sum and divide it by 2m. Why not divide by the number of data in the training set M, but divide it by twice times? This is actually a statistical consideration that can reduce the error.

Function j is actually a squared error function. As a cost function, J can of course be other function types, but for this linear regression problem, it is most reasonable to use the squared error function as the cost function. So our goal is to find the minimum value of the cost function J.

To be able to intuitively understand this problem, you can first simplify the problem and then look at it from the diagram. If we simplify the problem to the first parameter is 0, then the H function will become a straight line through the origin, and the J function will only be related to a variable, the function image is a unary two times function, we just have to navigate to the lowest point of a meta two function image, is the parameter we require, Can make the cost function J get the minimum value.

Back to the original question, if the first parameter is not 0, then the function J has two parameters, that is, a two-yuan two times the problem of the function, it is obvious that the function of the image can not be represented on the two-dimensional plane. In fact, the image representation of this function is in the three-dimensional diagram.

Intuition, in fact, there is a low point, but how should the lowest point of the request?

A contour map (Contour plot) is introduced here, but this is not a knowledge that must be known, but it is helpful to understand. The above three-dimensional diagram can actually be represented by a two-dimensional graph in a clever way. In the following figure, the graph on the right is a typical contour, and a line represents the same point that X and y get the same function value, that is, on this line, although the values of x and Y are different, the corresponding function values are the same. The point represented by the small red X represents the cost function J, whose original H function is the image shown in the left image. So, where is the lowest point in the picture on the right? Looking back at the three-dimensional diagram, you can intuitively assume that the lowest point is the center of the blue-purple circle.

Let's go back to the question, how to find out the two parameter values of the lowest point of the cost function through the mathematical method?

Gradient Descent algorithm

Yes, the algorithm we use is the gradient descent algorithm (Gradient descnet). The operation of this algorithm is as follows:

(1) Guess the value of two parameters, that is, a given parameter an initial value, for example, are initialized to 0, our goal is to gradually approximate the value we want from this value;

(2) Modify the two parameters continuously until the local minimum value of the cost function J is obtained.

We try to visualize this problem on a three-dimensional diagram. Suppose we arbitrarily take a value, in a certain position, and then we have to modify two parameters, how to modify it? It is certain that the two parameters are closer to the lowest point on the function image. This is a bit like climbing, but in turn, we are in the mountain, to want to the lowest point, then step by step, each step is down, low a little bit lower, and finally reached the bottom.

The expression for this algorithm is this.

Note that the key to recursive descent algorithm is that two parameters should be updated at the same time. This is very well understood, because the J function is related to two parameters, if not updated at the same time, then an updated parameter will inevitably affect the value of another parameter. Updating the implementation in the code at the same time can be seen in the lower part of the diagram.

There are two points that are very noteworthy for the calculation formula in this figure. The first is the parameter α, why should we multiply this parameter in front of the cost function J? In fact, α represents learning rate, it can be understood that in the function map "downhill" when we are small rags walk, or a stride? That is, α defines how much of the size interval we modify two parameter values at a time. In practice, α is expected to decrease with respect to the target, but the α-fixed size can also be guaranteed to converge to the minimum, as the partial derivative of the function becomes smaller as it approaches the local minimum value.

The partial derivative, is the second noteworthy place, namely

may have learned calculus to know this specific representative of what, but did not learn it is OK, the intuitive view, this score represents a point in the three-dimensional figure of the slope, popular point, is the steep degree, and the direction of the steep. In the formula, the existence of this partial derivative ensures that our algorithm will gradually approximate to the minimum value that the slope falls.

The gradient descent method can also be used for other algorithms, not just linear regression models. Here we discuss the gradient descent algorithm for linear regression.

Gradient Descent for linear regression

We take the formula of the cost function J into the gradient descent algorithm, then use the concept of partial derivative to simplify the formula, and finally we can get the formula. The specific derivation requires some knowledge of calculus.

We can actually use them directly. That is, the algorithm is probably written like this, we use these two formulas to constantly revise the value of two parameters, until the function J reached a minimum value. Now that we have this formula, we can apply the gradient descent algorithm.

If, in this process, all the training data is used in each step, then it is called Batch gradient descent (batch gradient descent).

Machine learning (Andrew Ng) Notes (b): Linear regression model & gradient descent algorithm