R Language Data Analysis series eight

--by Comaple.zhang

Again on the polynomial regression, this section again mentions the polynomial regression analysis, understand the fitting phenomenon, and in-depth cross-validation (cross-validation), regularization (regularization) framework, to avoid the occurrence of overfitting phenomenon, From a more in-depth perspective, this paper explores the theoretical basis and how to bring ideals into reality based on R.

Knowledge points in this section, and data set generation

1, ggplot2 to draw;

2, in order to fit more complex data datasets, we use the SIN function plus the random white noise data which obey the positive distribution.

3, the Poly function is fitted in high-time, and the transform is transformed by the data set;

4, understand how to fit, and how to use the regularization framework to avoid overfitting.

Test data generation:

Library (ggplot2) x <-seq (0,1,by=0.01) y <-sin (2*pi*x) +rnorm (Length (x), 0,0.1) DF <-data.frame (x, y) ggplot (DF, AES (X=X,Y=Y), main= "sin (x)") + Geom_point ()

High-function fitting

Let's look at the effect of a straight fit:

Fit1 <-lm (y ~ x,df) DF <-transform (Df,yy = Predict (fit1)) Ggplot (Df,aes (x=x,y=y)) + Geom_point () +geom_smooth (AES (x =X,Y=YY), DATA=DF)

#利用函数poly进行高次拟合, the function can automatically change the arguments to a higher number of times, and the degree parameter controls the times of the highest entry as follows:

PO <-poly (x,degree=3) fit3 <-lm (y ~ poly (x,degree=3), DF) DF <-transform (Df,yy3 = Predict (FIT3)) Ggplot (Df,aes (x =x,y=y) + geom_point () +geom_line (Aes (X=X,Y=YY3), data=df,col= ' Blue ')

We continue to increase the maximum number of times:

Fit26 <-lm (y ~ poly (x,degree=26), DF) DF <-Transform (df,yy26 =predict (fit26)) Ggplot (Df,aes (x=x,y=y)) + Geom_point () +geom_line (Aes (X=X,Y=YY26), data=df,col= ' Blue ')

We can see clearly, not the higher the better, then how to determine the appropriate number of times, it is generally believed that the higher the number of times, we think the more complex models, complex models can well fit the original data but the generalization ability is not necessarily the best, I tend to find a simple, and have a good generalization ability of the model. We can first analyze from a digital perspective, we call the summary function to see the results can be learned that only the 1,3,5 of these parameters passed the T verification, the other items are not passed. So how do we ensure the usability of the model, the following describes the method cross-validation and the regularization framework.

Model validation, Cross-validation

In order to better choose the suitable model, we model evaluation method RMSE (root mean square error)

The RMS error can be used to compare the quality of two models.

We use R to implement this evaluation method:

Rmse <-Function (y,ry) {return (sqrt (sum ((y-ry) ^ 2)/length (ry))}

Cross-validation's idea is to divide a data set into two parts, one for training the model called the training set Df$train, one for testing called the test set Df$test, and the following function to divide the dataset into two sections:

Split <-function (df,rate) { n <-length (df[,1]) index <-sample (1:n,round (Rate * N)) train <-DF [Index,] Test <-Df[-index,] df <-list (TRAIN=TRAIN,TEST=TEST,DATA=DF) return (DF)}

Let's take a look at the distribution of the RMSE under different frequency transformations:

Performance_gen <-Function (df,n) {performance <-data.frame () for (index in 1:n) { fit <-lm (y ~ poly (x,degree =index), data = Df$train) performance <-Rbind (performance,data.frame (degree =index,type= ' Train ', Rmse=rmse (DF $train [' Y '],predict (Fit))] performance <-rbind (performance,data.frame = degree ' test ', index,type= Rmse (df$test[' y '],predict (fit,newdata=df$test))) } return (performance)} Df_split <-split (df,0.5) performance<-Performance_gen (df_split,10) Ggplot (Performance,aes (X=degree,y=rmse,linetype=type)) +geom_point ( ) +geom_line ()

It can be seen that when degree=3 the model to the test set has been very good, when the degree increase, Rmse become more and more large, this shows that the model of the error is more and more large, and the model to the training set of the fitting is very good. So this scheme is the model when choosing Degree=3. So is there a better way, the regularization framework provides us with another way to avoid overfitting.

Regularization Framework

We said earlier that the loss function of the polynomial regression is, see the previous section (http://blog.csdn.net/comaple/article/details/44959695):

So I now introduce the concept of model complexity to improve our model loss function, the complexity of the smaller the better, that is, if the greater the complexity of our model is not good, if the complexity of the smaller the model to the training set of the better, it shows that our model is good, This avoids overfitting and increases the generalization ability of the model. So our regularization framework is transformed as follows:

The lambda here is actually a punitive factor for the regularization framework, and the larger the lambda proves that the more we care about the complexity of the model, the smaller the lambda proves that we don't care about the complexity of the model. F (w) is a function of the parameter W.

1, L1 paradigm, if we add absolute value to the W parameter, that is the L1 paradigm

2, L2 paradigm, if we take the sum of squares of the W parameter, that is the L2 paradigm

3, LP paradigm, if we take P-quadratic summation of the W parameter, that is the LP paradigm

Of course, the choice of paradigm is ultimately determined by experience and the role of your model, and the form of the function can be more complex to facilitate the processing of special effects.

This is illustrated with the data set above

L1 <-Abs (Sort (COEF (fit3))) Df_l1 <-data.frame (X=sort (Coef (FIT3)), Y=L1) Ggplot (Df_l1,aes (x=x,y=y)) +geom_line ( )

L2 <-(Coef (fit3) ^2) df_l2 <-data.frame (X=sort (COEF), fit3) y=l2 (Ggplot (Df_l2,aes)) x=x,y=y () + Ggtitle (' L2norm ')

The L1 paradigm allows for parameter selection, making certain unimportant parameters zero, which simplifies our model and makes the model more generalization capable.

But the value of the lambda here is how to choose, the different lambda represents our degree of concern about the complexity of the model. Here again to cross-validation to verify that Lambda is right for us.

In the R language the Glmnet package has implemented a regularization framework, where we install and load him. and call the Glmnet function, the function as long as the call can get all the possible lambda values and corresponding models, then we can the above method, the results of the RMSE evaluation and the lambda relationship between the drawing out to see the effect:

Install.packages (glmnet) library (glmnet) getperform <-function (DF) { fit<-with (Df$train,glmnet (poly (x, degree=10), y)) lambdas <-Fit$lambda performance <-data.frame () for (lambda in lambdas) { Performance < -Rbind (Performance,data.frame ( lambda=lambda, rmse=rmse (df$test[' y '],with (df$test,predict (Fit,poly (x, (degree=10), S=LAMBDA))))

As you can see, the RMSE takes the minimum value near the lambda 0.06, so the data set can take the corresponding model when lambda=0.06.

R Language Data Analysis series eight