Linear regression (Linear Regression), also known as linear regression, is a regression represented by a straight line, as opposed to a curve regression. If the dependent variable y on the argument X1, X2 ..., the regression equation of XM is the linear equation, i.e. μy=β0 +β1x1 +β2x2 + ... ΒMXM, where β0 is a constant term, βi is the regression coefficient of the independent variable XI, and M is any natural number. The regression of Y to X1, X2 、...、 XM is then called linear regression.
Simple regression:
A linear regression with only one argument is called simple regression, as in the following example:
x represents the quantity of an item, and Y indicates the total price of the different quantity of goods
X=[0, 1, 2, 3, 4, 5]
Y=[0, 17, 45, 55, 85, 99]
Plotting in two-dimensional coordinates such as:
Now what is the estimated total price of the goods when X = 6 o'clock?
We can clearly see that the total price of goods increases with the number of commodities, which is a typical linear regression.
Because there is only one argument X, we assume a linear regression model: Y = A * X + b
We need to find the most appropriate a, B value, so that the line: y = a * X + B with the trend of fitting, this time to predict the number of different goods X under the total price Y.
Least squares:
In order to find the most suitable a B, we introduce the least squares method.
Least squares, also known as least squares estimation. A common method for estimating overall parameters from sample observations. It is used from n pairs of observations (X1,Y1), (X2,y2), ..., (Xn,yn) to determine a best estimate of the corresponding relationship between x and y y=f (x), which minimizes the squared and H of the difference between the observed and estimated values (that is, deviations).
The least square method can eliminate the influence of accidental error, so that the most reliable and most probable result is obtained by a set of observation data.
We can clearly see the line y = a * X + B over the origin, i.e. B = 0
We try different a worth of results as follows:
A = 19 o'clock H = 154
A = 20 o'clock H = 85
A = 21 o'clock H = 126
The images were as follows:
We can draw a rough conclusion that a = 20,b = 0 o'clock, the linear model Y = * X is better fitting with the sample data.
So when the number of items X = 6 o'clock, we can roughly estimate the total price y = 20 * 6 = 120
Multivariate regression:
A linear regression greater than an independent variable is called multivariate regression.
The above example is just a self-variable, which is easier to handle, but if there are many independent variables, it is assumed that the arguments are M, [X1,X2,X3,X4.....XM].
At this point we assume that the regression coefficients (i.e. weights) also need to have m, that is, we assume that the linear model is Y = X0 + x1*w1 + x2*w2 + x3*w3 + ... + xm*wm
For calculation convenience, we go to W0 = 1
this: Y = x0*w0 + x1*w1 + x2*w2 + x3*w3 + ... + xm*wm
Written in vector form:
W = [W0,w1, W2, W3, ...., Wm]
X = [X0, X1, X2, X3, ...., Xm]
Y = wt * X (WT is transpose of vector W)
Squared sum of the difference between the observed and estimated values (i.e. deviations):
To facilitate subsequent calculations, we multiply the one-second on the left side of H, i.e.:
In the above formula, n indicates the number of training samples, m represents the number of characteristics (arguments) of each training sample, the superscript denotes a J sample, and the subscript denotes the I feature (argument value), which represents the total value of the first J sample.
Now H is about w0,w1,w2 .... WM function, we need to find the most suitable w value by the appropriate method, in order to obtain a better linear regression equation. Compared with simple regression, it is difficult to find a solution by observing and experimenting with different w values, we need to adopt the optimization algorithm.
Gradient algorithm:
Common optimization algorithms include gradient descent (Gradient descent), Newton and Quasi-Newton (Newton's Method & Quasi-Newton Methods), conjugate gradient method (conjugate Gradient), Heuristic optimization method And so on, this paper introduces the gradient algorithm in detail.
clear our current goal : We need a gradient algorithm to find out---when the h gets the smallest case, W0, W1, W2, W3, ..., the WM value, thus writing the regression equation.
Gradient algorithm is divided into gradient ascending algorithm and gradient descent algorithm. The basic idea of the gradient descent algorithm is that to find the minimum value of a function, the best method is to search along the gradient direction of the function, while the gradient rises instead. For a function f (x, y) with two unknown x, y, the gradient is expressed as:
For z = f (x, y), using the gradient descent algorithm means moving along the x axis, moving in the direction of Y, and the function f (x, y) must be defined and micro at the point to be computed.
It can be understood in a popular sense:
The gradient is actually the quickest direction for the function value to change. For example, if you are standing on a hill, the direction indicated by the gradient is the quickest direction of height change. You go in this direction and you can change (increase or decrease) the height of your position as quickly as possible, but if you walk around, you may walk for half a day and the height of your position does not change much. That is, if you have been walking along the gradient, you will be able to reach some peak or trough of the mountain as soon as possible. So in fact, the gradient algorithm is used to search for local minima or maxima, it is a very efficient, high speed and reliable method in practical application.
Using gradient descent method to find the minimum h
We saw earlier:
H is the function of w = [W0, W1, W2, W3, ..., Wm], the gradient of H is as follows:
This time for each of the WI gradients:
We assume that each time the step is updated in the gradient direction is α, so the value of W update formula can be written as:
So the pseudo code of the gradient descent algorithm is as follows:
Each of the regression coefficients (that is, each W value) has a value of 1
Repeat R times:
Calculate the gradient of the entire data set
Using the update regression coefficients w
Instance:
The linear regression equation of commodity data is obtained by gradient descent algorithm.
We assume that the linear regression model is total price y = a + b * X1 + c * X2 (X1 X2 indicates the number of goods in each)
We need to find the regression coefficients w = [A, b, c]
The gradient descent algorithm is as follows:
1 ImportNumPy as NP2 3 defGrad_desc (Train_data, train_labels):4 """Gradient Descent"""5Data_mat =Np.matrix (train_data)6Label_mat =Np.matrix (Train_labels). Transpose ()7n = np.shape (Data_mat) [1]8 #Step Size9Alpha = 0.001Ten #maximum number of cycles OneMax_cycles = 100 A #Initialize the regression coefficients weights -weights = Np.ones ((n, 1)) - forIndexinchRange (max_cycles): theH = Data_mat * weights-Label_mat -weights = Weights-alpha * Data_mat.transpose () *h - #returns a flattened array of coefficients - returnNp.asarray (weights). Flatten ()
We use the above algorithm to obtain a regression coefficient of
[1.7218815 4.24881047 5.28838946]
random gradient algorithm:in the above gradient algorithm, the cyclic r = 100 times, each update regression coefficient needs to traverse the entire data set, if the data sample is large, then the computational time complexity will be very high. Therefore, each time a sample point is used to update the regression coefficients, called the random gradient algorithm. The pseudo code of the random gradient descent algorithm is as follows:all regression coefficients are initialized to 1repeat R times:cycle through each sample:calculate the gradient of the sample
using the update regression coefficients W
The modified algorithm is as follows:
1 ImportNumPy as NP2 3 defAdvanced_random_grad_desc (Train_data, train_labels):4 """improved stochastic gradient descent"""5Data_mat =Np.asarray (train_data)6Label_mat =Np.asarray (train_labels)7M, n =Np.shape (Data_mat)8 #Step Size9Alpha = 0.001Ten #Initialize the regression coefficients weights Oneweights =np.ones (n) AMax_cycles = 500 - forJinchRange (max_cycles): -Data_index =list (range (m)) the forIinchRange (m): -Random_index =Int (np.random.uniform (0, Len (data_index))) -h = SUM (data_mat[random_index] * weights)-Label_mat[random_index] -weights = Weights-alpha * H *Data_mat[random_index] + delData_index[random_index] - returnWeights
The regression coefficients are calculated as:
[1.27137416 4.31393524 5.2757683]
We can get the linear regression equation as:
Y = 1.27 + 4.31 * X1 + 5.28 * X2
Written in the following words:
The complete code for this article has been uploaded: Https://gitee.com/beiyan/machine_learning/tree/master/gradient
The random gradient descent (ascent) algorithm is widely used and has a very good effect, and subsequent articles will use the gradient algorithm to solve some problems. No exception, the gradient algorithm is also flawed, such as near the minimum convergence speed, linear search may produce some problems, may be "zigzag" drop, etc., in addition, the choice of descending or ascending step will also affect the resulting regression coefficients, we can change some parameters to test the effect of regression.
Machine learning--linear regression and gradient algorithm