**what is linear regression. **The so-called linear regression (taking a single variable as an example) is to give you a bunch of points, and you need to find a straight line from this pile of points. Figure below

This screenshot is from Andrew Ng's < machine learning public class >

What you can do when you find this line. Let's say we find A and b that represent the line, then the line expression is y = a + b*x, so when a new x is present, we can know Y. Andrew ng First Class said, what is machine learning. It is

A computer program was said to learn from experience E with respect to some task T and some performance measure P if it PE Rformance on T, as measured by P, improves with experience E.

This sentence is really not good translation: Learn from experience e How to accomplish task T, and use the method p to measure the quality of T. Through the learning of experience e, we can continuously improve the performance of task t measured by P. (PAT)

OK, so linear regression is doing one thing, giving you a bunch of historical data, you train a straight line (in the case of univariate linear regression), and then with the new x input, you can know how much y is with this straight line expression. So as to achieve the purpose of prediction.

**How to beg this straight line. **It's easy to look at the Green line above. In fact, it represents the mathematical logic behind it: all points to the distance of the line and the smallest. Andrew ng called the cost function, see figure below

This screenshot is from Andrew Ng's < machine learning public class >

J is a two-time function on THETA0 and theta1, and the minimum value of two functions is of course the least squares, and when the theta is less, it can be calculated by hand, but it is not always a long-term one. See below Gradient descent, gradient descent method,

Select an initial value for theta (typically 1) and update theta with the following formula. Where Alpha is greater than 0, if Theta's partial differential (slope) > 0, indicating that the current theta value is at the right of the lowest point, then through the following formula, the Theta will move left (decrease), as long as the alpha value is reasonable, theta will be close to the lowest point. Vice versa

This screenshot is from Andrew Ng's < machine learning public class >

Theta all come out, then this line can also be expressed, the algorithm is described as follows:

Important thing to say three times: This screenshot is from Andrew Ng's < machine learning public class >

Talk was easy, show me the code: The following is the MATLAB implementation, very simple, here refer to Andrew ng work templates

function [Theta, j_history] = Gradientdescentmulti (X, y, theta, Alpha, num_iters)
% Initialize some useful values
m = Length (y); % Number of training examples
j_history = zeros (num_iters, 1);
The number of iterations is self-determined, how much also see the convergence of the experiment for
iter = 1:num_iters
% hint:while debugging, it can be useful to print out the Valu Es% of the cost function (computecostmulti) and gradient here.
%
h = X * theta; After all, X and Theta are consistent, first calculate h after the
error = h-y; % current error
theta = Theta-alpha * (1/m * SUM (error. * X) ');% with iteration formula, Alpha right is the formula for partial differential, sum is the column sum, each of the training data of each variable is divided into Don't sum it up. Vector error and Matrix X. * is error[0] * x[, 0] (first column)
% ============================================================
% Save the cost J in every iteration
j_history (iter) = Computecostmulti (X, y, theta);
End
End

The curve that fits is probably like the following

Say two more sentences, in order to determine the GD method correct operation, you can record the J value of each iteration, draw the following curve, the following are not normal

Summary: If Alpha is too small, the convergence will be slow, and if Alpha is too large, the J value may not converge on how the alpha value is evaluated. Try ... 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1 ...

In order to make the GS algorithm run more comfortable, there are feature scaling and mean normalization and other methods to process the data first here today