1. Preface
The linear regression form is simple and easy to model, but it contains some important basic ideas in machine learning. Many of the more powerful non-linear models (nonlinear model) can be obtained by introducing hierarchies or high-dimensional mappings on the basis of linear models. In addition, because the solution of linear regression \ (\theta\) intuitively expresses the importance of each attribute in the prediction, the linear regression has good explanatory property.
2. Linear regression principle
The problems encountered with linear regression are generally the case. We have a \ (m\) sample, each corresponding to the \ (n\) dimension feature and a result output.
Training data in the form of:
\[(x_1^{(0)}, x_2^{(0)}, ... x_n^{(0)}, Y_0), (x_1^{(1)}, x_2^{(1)},... x_n^{(1)},y_1), ... (X_1^{(M)}, x_2^{(M)}, ... x_n^{(M)}, Y_n) \]
The main thing we do is by finding the parameters \ ((\theta_0,\theta_1,... \theta_m) \), the linear regression model is as follows:
\[h_\theta (x_1, x_2, ... x_n) = \theta_0 + \theta_{1}x_1 + ... + \theta_{n}x_{n}\]
The matrix is as follows:
\[h_θ (X) =xθ\]
We get the model, we need to find the loss function, general linear regression we use the mean square error as the loss function. The algebraic method of the loss function is expressed as follows:
\[j (\theta_0, \theta_1 ..., \theta_n) = \sum\limits_{i=0}^{m} (H_\theta (X_0, x_1, ... x_n)-y_i) ^2\]
The matrix is as follows:
\[j (\mathbf\theta) = \frac{1}{2} (\mathbf{x\theta}-\mathbf{y}) ^t (\mathbf{x\theta}-\mathbf{y}) \]
3. Algorithm for linear regression
Loss function for linear regression \ (J (\mathbf\theta) =\frac{1}{2} (\mathbf{x\theta}-\mathbf{y}) ^t (\mathbf{x\theta}-\mathbf{y}) \) , we often use two methods to find the \ (θ\) parameter when the loss function is minimized: one is the gradient descent method and one is the least squares.
If the gradient descent method is used, then the \ (\theta\) Iteration formula is this:
\[\mathbf\theta= \mathbf\theta-\alpha\mathbf{x}^t (\mathbf{x\theta}-\mathbf{y}) \]
After several iterations, we can get the final \ (\theta\) result
If the least squares method is used, the result formula for \ (\theta\) is as follows:
\[\mathbf{\theta} = (\mathbf{x^{t}x}) ^{-1}\mathbf{x^{t}y}\]
Of course linear regression, there are other commonly used algorithms, such as Newton's method and quasi-Newton method, which is not described in detail here.
4. Polynomial linear regression
The data we encounter is not necessarily linear in form, if the formula \ (y=x_1^2+x_2^2\) model, that linear regression is difficult to fit this function, it is necessary to use the polynomial regression.
Back to the linear model we started,\ (H_\theta (x_1, x_2, ... x_n) = \theta_0 + \theta_{1}x_1 + ... + \theta_{n}x_{n}\), if this is not just a one-time X, but two , then the model becomes a polynomial regression. Here is a model of a 2-time polynomial regression with only two features:
\[h_\theta (x_1, x_2) = \theta_0 + \theta_{1}x_1 + \theta_{2}x_{2} + \theta_{3}x_1^{2} + \theta_{4}x_2^{2} + \theta_{5}x_{1 }x_2\]
We make \ (x_0 = 1, x_1 = x_1, x_2 = x_2, X_3 =x_1^{2}, X_4 = x_2^{2}, x_5 = x_{1}x_2\), so we get the following formula:
\[h_\theta (x_1, x_2) = \theta_0 + \theta_{1}x_1 + \theta_{2}x_{2} + \theta_{3}x_3 + \theta_{4}x_4 + \theta_{5}x_5\]
We can find that we are back to linear regression, which is a five-yuan linear regression, which can be accomplished by linear regression method. For each two-element sample feature \ ((x_1,x_2) \), we get a five-element sample feature \ ((1,x_1,x_2,x^2_1,x^2_2,x_1x_2) \), through this improved five-element sample feature, We re-turn the function that is not linear regression back into linear regression, but it achieves the effect of nonlinear fitting.
5. Generalized linear regression
In the polynomial of the linear regression in the previous section, we transform the sample characteristics and complete the effect of the nonlinear regression with linear regression. Here we promote the feature \ (y\) . For example, our output \ (y\) does not satisfy the linear relationship with \ (x\) , but \ (logy\) and \ (x\) satisfy the linear relationship, the model functions are as follows:
\[logy=xθ\]
In this way, with each sample input \ (y\), we use \ (logy\) to correspond, so we can still use the linear regression algorithm to deal with the problem. We generalize \ (logy\) , assuming that the function is a monotone function \ (g (.) \), the generalized linear regression form of generalization is:\ (g (Y) =xθ\) or \ (y=g^{-1} (xθ) \). This function g (.) We are often referred to as contact functions. The logical regression that we'll talk about later is the classification based on the contact function.
6. Regularization of linear regression
In order to prevent the model from overfitting, we often need to join the regularization when we build the linear model. Generally there are L1 regularization and L2 regularization.
6.1
L1 regularization of lasso regression
L1 regularization is often referred to as Lasso regression, which differs from general linear regression in that it adds a L1 regularization item to the loss function, and that the L1 regularization item has a constant coefficient \ (\alpha\) to regulate the weight of the mean variance and regularization items of the loss function. The loss function expression for the specific Lasso regression is as follows:
\[j (θ) =\frac{1}{2n} (xθ-y) ^t (xθ-y) +\alpha|θ|_1\]
where \ (n\) is the number of samples,\ (\alpha\) is a constant coefficient, it needs to be tuned. \ (|θ|_1\) is the L1 norm.
Lasso regression can make the coefficients of some features smaller, or even some of the smaller absolute coefficients directly into 0. Enhance the generalization ability of the model.
6.2
L2 regularization of ridge regression
L2 regularization is often referred to as the ridge regression, which differs from the general linear regression in that it adds a L2 regularization to the loss function, and the difference between lasso regression is that the regularization term of the ridge regression is the L2 norm, and the regularization of the lasso regression is the L1 norm. The loss function expression for the specific ridge regression is as follows:
\[j (θ) =\frac{1}{2} (xθ-y) ^t (xθ-y) +\frac{1}{2}\alpha|θ|_2^2\]
where \ (\alpha\) is a constant coefficient, it needs to be tuned. \ (|θ|2\) is the L2 norm.
Ridge regression in the case of not abandoning any one of the characteristics of the regression coefficient, so that the model is relatively stable, but with the lasso regression ratio, which will make the characteristics of the model left more special, model interpretation is poor.
7. Summary
The algorithm of linear regression is not complex in itself, but the content which is extended on its basis is quite rich, which involves the feature transformation in polynomial (feature engineering), the regularization item for overfitting, the use of very extensive logistic regression and so on. To really understand it requires the knowledge of machine learning to converge.
(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])
Linear regression (Linear Regression)