One: Introduction
Definition: Linear regression satisfies the linear relation in the hypothesis, trains a model according to the given training data and uses this model to predict. To understand this definition, let us first give a simple example: we assume a linear equation y=2x+1, the x variable is the size of the commodity, Y is the sales volume, and when the month x = 5 o'clock, we can predict y = 11 sales According to the linear model; For the simple example above, we can roughly put y =2x+ 1 See the model of regression; we can predict sales for each product size given, of course, how this model is obtained is the linear regression content we want to consider below, and the factors that affect sales (Y) In reality are many, we take the commodity size (X₁), commodity price For example (X₂) as an example:
Before machine learning, getting data is the first step (no rice is difficult to cook), suppose our sample is as follows: X1 is the size of the commodity, X2 for the price of the commodity, Y for the sales of goods;
Two: Model derivation
In order to derive the model, the linear model can be set in the assumption that the data satisfies the linear model, the X1 characteristic is the size of the commodity, the X2 characteristic is the price of the commodity;
After the model is assumed, we put the training data into the set model, and we can predict the final value of the sample by the model.
Then the sample real value Y and the model training predicted the value between the error ε, and then assume that the training sample data volume is very large, according to the central limit law can be obtained ∑ε satisfies (U, δ²) Gaussian distribution; Since the equation has intercept terms, you can use U = 0; So the Gaussian distribution of (0,δ²) is satisfied;
As can be seen above, for each sample x, the surrogate to P (y |x; θ) will be given a y probability, and because the set sample is independent of the same distribution, the maximum likelihood function is obtained:
The simplification is as follows:
The formula of the least square of the regression loss function is obtained, and the Formula Two multiplication is given directly to the linear loss function of the linear regression in many general introductions. Below we have done a phased summary of the above: linear regression, according to the law of large numbers and the central limit law, assuming that the sample infinity, its true value and the error of the predicted value of ε plus and obey u=0, variance =δ² Gaussian distribution and independent distribution, and then the ε=y-øx into the formula, The loss function of linear regression can be obtained by simplifying it.
The second step: to optimize the loss function is to find the w,b, so that the loss function is minimized; The first method uses matrices (which need to meet reversible conditions)
The above is the matrix method to optimize the loss function, but the above method has certain limitations, is to be reversible; let's talk about another optimization method gradient descent method, the description of gradient descent method and the explanation of a lot of information, in-depth explanation here does not proceed, you can refer to:/http www.cnblogs.com/ooon/p/4947688.html this blog, Bo Master on the gradient descent method to explain, we here is the most simple flow of explanation;
The overall flow as shown above, is to find the gradient of each variable, and then follow the gradient direction of a certain step A, the variable update; below we ask for the gradient of each variable, the following equation for each theta gradient solution is as follows:
As above we find the gradient of the variable, and then iterate over the following formula to iterate the calculation:
The above each update variable, all the sample to add up, the data is large when the efficiency is not high, there is a single sample to optimize, is the random gradient drop:
W,b can be obtained by optimizing the above steps to obtain the optimized characteristic equation: Say so much first the previous code:
#!/usr/bin/python #-*-coding:utf-8-*-import numpy as NP import warnings from sklearn.exceptions import Convergencewa Rning from Sklearn.pipeline Import pipeline to sklearn.preprocessing import polynomialfeatures from Sklearn.linear_ Model Import LINEARREGRESSION,RIDGECV,LASSOCV,ELASTICNETCV Import matplotlib as MPL import Matplotlib.pyplot as PLT if __
name__ = = "__main__": warnings.filterwarnings (action= ' ignore ', category=convergencewarning) np.random.seed (0) Np.set_printoptions (linewidth=1000) N = 9 x = np.linspace (0, 6, N) + NP.RANDOM.RANDN (n) x = Np.sort (x) y = x**2-4*x-3 + NP.RANDOM.RANDN (N) X.shape =-1, 1 y.shape = 1, 1 p =pipeline ([' Poly ', polynomia Lfeatures ()), (' Linear ', linearregression (Fit_intercept=false))]) mpl.rcparams[' font.sans-serif '] = [u ' Simhei ' ] mpl.rcparams[' axes.unicode_minus ' = False np.set_printoptions (suppress=true) plt.figure (figsize= (8, 6), FAC Ecolor= ' W ') D_pool = Np.aRange (1, N, 1) # Order m = d_pool.size clrs = [] # color for C in Np.linspace (16711680, 255, m): Clrs.appen D (' #%06x '% c) line_width = Np.linspace (5, 2, M) plt.plot (x, y, ' ro ', ms=10, zorder=n) for I, D in enumerate (d
_pool): P.set_params (Poly__degree=d) p.fit (x, Y.ravel ()) Lin = P.get_params (' linear ') [' linear '] Output = U '%s:%d order, coefficients are: '% (U ' linear regression ', d) print output, lin.coef_.ravel () X_hat = Np.linspace (X.min (), X.max (), num=100) X_hat.shape =-1, 1 y_hat = P.predict (x_hat) s = P.score (x, y) z = N-1 if (d = = 2) Else 0 label = U '%d order, $R ^2$=%.3f '% (d, s) plt.plot (X_hat, Y_hat, Color=clrs[i], lw=line_width [i], Alpha=0.75,label=label, zorder=z) plt.legend (loc= ' upper left ') Plt.grid (True) # plt.title (' Linear back
fontsize=18) Plt.xlabel (' X ', fontsize=16) Plt.ylabel (' Y ', fontsize=16) plt.show ()
The Print console information is visible after running the code as follows:
The image is displayed as follows:
As can be seen from the above image, when the model complexity increases, the training set of data fitting is good, but there will be over-fitting phenomenon, in order to prevent the occurrence of this overfitting phenomenon, we added a penalty in the loss function, according to the different penalty items are divided into the following:
Finally, for elastic Net regression, the L1 regular and L2 are combined in a certain proportion:
L1 tend to produce a small number of features, while others are 0, while L2 will choose more features, which will be close to 0. Lasso is very useful in feature selection, and ridge is just a rule. In cases where only a few features play an important role in all features, it is appropriate to select Lasso because it automatically selects features. And if most of the features work, and the effect is average, then using ridge may be more appropriate. For a comparison of various regressions, you can see the following diagram: