introduction of regression forecast
now we know that the word regression was first made by Darwin's cousin Francis Galton invented the. Galton The first time a regression prediction was used to predict the size of the next-generation pea seed, based on the size of the pea seed of the previous year. He applied regression analysis to a large number of objects, including the height of the person. He noted that if the parents were taller than average, their children tended to be taller than the average, but were less than their parents. The child's height is back to the average height (return). Galton has noticed this phenomenon in a number of studies, so although the word has nothing to do with numerical prediction, it is still called regression.
second, linear regression
to understand the regression method, we can start from the regression on the two-dimensional plane to understand the meaning of the regression intuitively. For example, there are a series of points in the plane (x1,y1), (x2,y2) ... (Xn,yn) , 1 is shown. We can intuitively find that these data have a significant linear relationship, the purpose of regression is to find such a line, so that all data points as close as possible to the straight line. The equation for this straight line is called the regression equation. This allows us to predict the target value based on the regression equation of the straight line.
Figure 1
The same approach can be used when generalized to multidimensional space, except that the equation that is found is a hyper-plane, which is the same as what we call the hyper-plane in the support vector machine. Below we will show you how to find such a plane.
Mathematical Theory of 2.1 linear regression
for data X (X1,X2,X3...XN)with M- dimensional eigenvectors , we can write the regression model in the following form:
where X1,X2...XN is n independent variables, ε is an error term and is subject to normal distribution N (0,0^2).
B1,B2...BM is the coefficient of this m independent variable. For the sake of convenience, we introduce matrix notation to the sample data actually observed by the M group:
which X is the model design matrix, y and ε are random vectors, and there are:
( En is the n-Order unit matrix)
so a regression model with n samples can be written as:
ε is a random error vector that is not observable,B is a vector of regression coefficients, and the coefficients need to be determined.
2.2 Estimating regression parameters with least squares method
Set Βi ' to Βi (I=1,2...M), then the sum of squared errors when Βi is Βi ' is:
according to the limit theorem, when Q The partial derivative of each parameter in B is 0 o'clock Q to obtain the minimum namely:
To solve the equation:
After finishing, we get the normal equation group:
the knowledge of linear algebra is known as When x is full rank, the solution of the above expression is:
implementation of the 2.3 Python code
First you need to define a function standregress () function, which enters the feature matrix and label of the training set, and returns the weight value. As shown below
def standregres (Xarr,yarr): #标准的回归预测 Xmat = Mat (Xarr); Ymat = Mat (Yarr). T #将数组转化为矩阵的形式 xTx = Xmat.t*xmat      if linalg.det (xTx) = = 0.0: #判断行列式是否为0 print ( "This matrix was singular, cannot do Inverse ") return ws = xtx.i * (xmat.t*ymat) #计算权重值      return ws |
for intuitive understanding, we still use univariate data to illustrate regression predictions. The data set is ex0.txt. Each data in the change file contains only one feature variable and one target value. Using the above function, we can obtain the coefficients of the regression equation , and then draw the scatter plot of the regression line and training set in the same image using the drawing function in the Matlplotlib library, as shown in the result 2.
Figure 2 regression line
Third, local linear weighting
From the above example, we find that all the data points are evenly distributed on both sides of the line, but the linear regression is not very good to composite the base point from the visual view. In this way, the predicted data of the model has a great deviation and can not get a good prediction result. The problem with linear regression is that the unbiased estimation of the minimum mean square error is obtained in the fitting process, but it is easy to lead to the overfitting phenomenon. Some methods allow some deviations in estimation, thus reducing the mean square error of the prediction. Local weighted linear regression is one such method.
The idea of local linear weighting algorithm is to give a certain weight to each point near the point of prediction. The normal regression is then performed on this subset based on the minimum mean variance. Therefore, the error of the linear regression can be changed to:
We asked about the above- The partial derivative of B, and makes it 0. Finally, the result is:
which W is a weight matrix that assigns weights to each data point. In this algorithm, the "kernel" is used to assign weights to nearby points. The type of the nucleus is freely selectable, but the most commonly used is the Gaussian nucleus, the corresponding weights of the Gaussian nuclei are:
this would be short of a matrix containing only diagonal elements. B, and the closer the point x is to the point x (i), the greater the W (i,i) will be. The Gaussian kernel function has only one parameter to be determined the size of the K,k value determines the weight assigned to the nearby point. In the X-w curve, the larger the K value, the more flat the image is, and the steeper the image. We predict that the nearby points at point 0.5 are shown in X-W curve 3 under different k values.
Fig. 3 graph of weights and distances under different K values
Now we can use the same data set as the original to observe the effect of the local linear weighting algorithm. k=0.003,0.01,1.0, respectively, as shown in result 4.
Fig . 4 results of local weighting algorithm under different k conditions
weighted linear regression is a disadvantage of weighted linear regression prediction because of the need to re-determine weights at each point in the prediction, so the computational amount is very large. Therefore, in practical application, we need to control the size of the parameter k, so as to avoid the phenomenon of under-fitting and overfitting. The appropriate size of the K value is generally related to the dataset itself and needs to be chosen based on experience.
Code implementation: it defines two functions LWLR () and lwlrtest () that are used to calculate the weight value of local linear regression, which is used to predict a given point prediction value.
def LWLR (testpoint,xarr,yarr,k=1.0): #局部加权线性回归 Xmat = Mat (Xarr), Ymat = Mat (Yarr). T m = shape (Xmat) [0] #表示x中样本的个数 weights = Mat (eye (m)) #初始化为单位矩阵 For J in Range (m): #下面两行用于创建权重矩阵 Diffmat = testpoint-xmat[j,:] #样本点与待预测点的距离 Weights[j,j] = exp (diffmat*diffmat.t/( -2.0*k**2)) #使用的是高斯核 xTx = xmat.t * (weights * xmat) if Linalg.det (xTx) = = 0.0: #判断行列式是否为满秩 print ("This matrix was singular, cannot do inverse") return ws = XTX.I * (XMAT.T * (weights * ymat)) return testpoint * ws
def lwlrtest (testarr,xarr,yarr,k=1.0): #利用局部加权回归预测出待测点的值 m = shape (Testarr) [0] yhat = zeros (m) For I in range (m): Yhat[i] = LWLR (testarr[i],xarr,yarr,k) return Yhat |
Iv. Ridge Reunification
The previous linear regression and local weighted linear regression can no longer be used when the data is more characteristic than the number of sample points. The number of features is more than the sample point (n>m), which means that the matrix X of the input data is not a column-full-rank matrix, whereas a matrix other than full-rank has an error when calculating (XTX)-1.
For example, for a matrix x=[1,2,3;4,5,6]. It is not a column full rank matrix, we calculate the inverse of XTX, the result is (XTX) -1=0. This means that its determinant is exactly zero. In general, there is the same result. Can prove that R (AA ') =r (a), so when A is not full rank AA ' is not full rank, so the determinant is 0.
in order to solve the above problems, the statisticians put forward the concept of ridge regression. In short, the ridge regression is to add a λe (where E is the m*m unit matrix) on the XTX. In this case, the formula for the regression coefficients becomes:
It is necessary to specify the meaning of "ridge" in the ridge return. Ridge regression using the unit matrix multiplied by the constant λ, we observe that the unit matrix E can see the value of 1 throughout the diagonal, the remaining elements are 0, in the 0 composition of the plane has a 1 composition of the "ridge" is the reason for the ridge return "Ridge".
we will still use the data set above to test the statistical characteristics of the ridge regression. Below is the ridge return implementation code for the ridge regression python:
def ridgeregres (xmat,ymat,lam=0.2):#用于计算回归系数 xTx = Xmat.t*xmat denom = xTx + eye (Shape (Xmat) [1]) *lam if linalg.det (denom) = = 0.0: Print ("This matrix was singular, cannot do inverse") return ws = Denom. I * (Xmat.t*ymat)#I表示逆 return ws
def ridgetest (xarr,yarr): Xmat = Mat (Xarr); Ymat=mat (Yarr). T Ymean = Mean (ymat,0) #计算y的均值 Ymat=ymat-ymean #y变成与均值之间的距离 #数据的标准化 Xmeans = mean (xmat,0) #计算x的均值 Xvar = var (xmat,0) #计算x的方差 Xmat = (xmat-xmeans)/xvar#标准化后的x numtestpts =#计算30个不同lambda下的权重矩阵 Wmat = Zeros ((Numtestpts,shape (Xmat) [1]) For i in range (numtestpts): ws = Ridgeregres (Xmat,ymat,exp (i-10)) Wmat[i,:]=ws. T return Wmat |
The first function , Ridgeres (), implements the ridge regression solution for a given lambda, which by default is lambda 0.2. In addition, to use ridge regression and reduction techniques, it is necessary to standardize the features first. The second function, Ridgetest (), first standardizes the data, making the characteristics of each dimension equally important. You can then output a weight matrix of 30 different lambda. Then plot the ridge regression as shown in Figure 5. For the determination of Ridge regression parameters, we can use the ridge trace method, which is the lambda value taken in a place where the coefficients are relatively stable in the ridge map. You can also use the GCV generalized crossover method.
Fig . 5 Change chart of ridge regression coefficient
The Ridge regression was first used to deal with more than a sample number of features, and is now used to add deviations to the estimate to get a better estimate. λ is introduced here to limit the sum of all w, and this technique is reduced by introducing the penalty to reduce the unimportant parameters. Ridge Regression is a supplement to the least-squares regression, which loses unbiased, but returns high numerical stability.
Reference documents:
"1" machine learning combat
"1" Baidu Encyclopedia. Ridge return
"2" Baidu Library. Ridge Regression and Lasso
Https://wenku.baidu.com/view/a1d929bd50e2524de4187e2a.html
Machine learning Algorithm • Regression prediction