Nine algorithms for machine learning---regression
Transferred from: http://blog.csdn.net/xiaohai1232/article/details/59551240
Regression analysis is to quantify the size of the dependent variable affected by the independent variable, to establish a linear regression equation or a nonlinear regression equation, so as to predict the dependent variable, or the interpretation of the dependent variable.
The regression analysis process is as follows:
① exploratory analysis, draw scatter plots between different variables, carry out correlation test, understand the general situation of the data, and know the key to focus on the variables;
② variable and model selection,;
③ regression analysis hypothesis condition verification;
④ common linearity and strong influence point check;
⑤ model modified, and repeated ③④;
⑥ model Validation.
Basic principle
Correlation coefficients can only describe the correlations between variables, and the correlation cannot be quantified, and regression analysis can do this.
One-dimensional linear regression equation is: y=β0+β1x1+β2x2+...+βixi+ε
The linear regression analysis is to analyze the significance of the linear relationship and the significance of the regression coefficients, as well as the test of the residual ε.
The significance of the linear relation is assumed that the H0 is the β0=β1=...=βi=0 of the wireless relationship between the variables. Here the statistics are two F statistics, as long as F>f1-α, then reject the original hypothesis.
The significance of regression coefficients, after determining the significance of the linear relationship, also need to test the significance of the return coefficients of each variable, that is, the elimination of some dispensable variables and regression coefficients, and finally to simplify the effect of linear equations, the general use of T-Statistic test, the original hypothesis is the regression coefficient is not significant, that is βi= 0.
Residual error test, it is easy to think that residuals are random, can obey the characteristics of the distribution of the positive. Otherwise, it is considered that the information of the overall difference is not extracted and needs to be taken into consideration from other aspects. If the residual scatter plot is a two-time function distribution, you should increase the two-time term of the variable. or the residual is not independent, there is a self-related relationship, a few future blog will be further explained. or the variance of the residuals is non-homogeneous, as the independent variable increases and the variance increases, that is, the dependent variable should be a conversion.
Correlation test of exploratory analysis
H0 The original hypothesis is: correlation coefficient ρ=0.
PROC CORR Data=ex. RETAIL RANK nosimple Plots (only) =scatter (Ellipse=none nvar= all); VAR MEMBER SQUARE INVENTORY LOYALTY POPULATION TENURE; With REVENUE; RUN;
The above option is used, rank indicates that Pearson correlation coefficients are arranged from large to small in the output report. Nosimple indicates that the basic statistical report is not output. Plots (only) indicates that only plots specified graphics (not output proc Corr default other graphics) are output. Scatter indicates a scatter plot to make a bright variable. Nvar=n shows that the analysis of n variables in Var, nvar=all up to 10 variables. If there is no with statement then the Var in the variable 22 analysis, if there is a with statement description with each of the variables in the var with each variable 22 analysis.
The results are as follows:
The associated coefficients of the with variable (REVENUE) and each var variable, and the P-value. There is a strong correlation between square and population and dependent variables.
The value lists only the scatter plots of the relevant several variables:
Selection of variables and models
All options: Assuming that we do not have any prior probabilities on the variables, we can set up the automatic fitting of all possible combinations of variables in the procedure step.
PROC REG Data=ex. RETAIL plots (only) =(Rsquare adjrsq CP); All_reg:model revenue=MEMBER SQUARE INVENTORY LOYALTY POPULATION TENURE /selection=rsquare adjrsq CP; RUN; QUIT;
Plots (only) = (Rsquare adjrsq cp) shows only the diagram of R-side and adjusted R-side and CP, and the option Selection=rsquare adjrsq CP is sorted by the value of the first statistic in the output report. The result is as follows: The above is the statistical value of the fitting of the model that outputs all the variable combinations. Previous blog has mentioned that the adjustment R side and R side difference is: adjust R side to avoid R-side statistics is the more variable R-square value of the case, to avoid the user to cause more variables, the better misleading. The star symbol represents the best model calculated from the fixed parameters. After adjusting the R-side display the best model contains two or three parameters. In the CP scatter plot, there are two reference lines mallows the function lines of y=p (p is the parameter number), Hocking is: y=2p-pfull+1. When Cp<=p, it represents the model suitable for prediction. When Cp<=2p-pfull+1 represents the model, it is suitable for parameter estimation and interpretation of dependent variables. Since there are too many scattered points in the CP scatter plot, you can use plots (only) = (CP) and add best=20 to the Model option to display only the CP chart and the first 20 points of the value display to see the CP diagram more clearly. It is concluded that the variables used to predict the variables are square, which are interpreted as the purpose of the regression model are: square and inventory. After selecting the variables, you can fit the model and estimate the parameters, and submit the code:
PROC REG data=EX. RETAIL; Predict:model revenue=SQUARE; Explain:model revenue=SQUARE INVENTORY; RUN; QUIT;
Predict in code: Labels for two models are shown in the report, with the following results:
Table One is: the significance of the linear model test, the table f=112.78,p<0.001 represents a predictive model of revenue and Square has a significant linear relationship.
The R square = 0.5351 in Table II represents a regression model that interprets 54% of the expected variable.
In table three, the parameters are estimated β0=31.47β1=1.48, so the equation of the regression model is: Y=31.47+1.48*square.
The fitting diagnosis validates the residuals, including indicating the strong impact point and residual distribution. For example, there are several points in the 23rd figure of the two range lines in the shadow point. There are also residual distribution tests why it is easy to know that obedience can only be a strange distribution.
Square's residual plot is almost uniformly distributed within the plane, further explaining the normality.
Indicates the accuracy of the fitted model, with a 95% prediction limit within the dashed line: the probability of a value of 95% for a given square,revenue falls within the predicted limit.
The dark area is 95% confidence limit: that is, given a mean value of a square,revenue, 95% of the probability falls within the confidence limit.
Interpreting the output of model and predictive models the class is, here is not to repeat.
The above is the whole selection method, the same variable selection method also has the forward selection method, the backward selection method, the gradual selection method, in the previous blog on the principle has been introduced, here does not say. Next look at the code:
PROC REG Data=ex. RETAIL plots (only) =adjrsq; Forward:model revenue=member SQUARE INVENTORY LOYALTY POPULATION tenure/selection=FORWARD; Bckward:model revenue=member SQUARE INVENTORY LOYALTY POPULATION tenure/selection=bckward; Stepwise:model revenue=member SQUARE INVENTORY LOYALTY POPULATION tenure/selection=stepwise; RUN; QUIT;
Each of the three selected method-generated models presented here is explained by a stepwise selection model, such as:
Once the square variable is selected, the linear relationship and the estimated parameters are also validated
Indicates that the square variable cannot enter the model in the selected variable, and finally the summary of the stepwise selection method.
Finally, adjust the R-square diagram. Indicates the best model at the first step.
The final conclusions of the three variable selection methods may be different, and this will require a tradeoff between the users. (Use R-side to weigh the overall contribution rate of the representation)
Co-linearity diagnosis between independent variables
It is necessary to know that collinearity between independent variables can easily lead to instability of the model. The vif is the Model option (also known as the variance expansion coefficient), which allows for collinearity diagnosis.
VIFi=1/(1-ri2)
If you submit the following code:
PROC REG Data=ex. RETAIL plots (only) =adjrsq; Fullmodel:model revenue=member SQUARE INVENTORY LOYALTY POPULATION tenure/VIF; RUN; QUIT;
The following results are output:
Linear model detection and variance expansion coefficients are output, where Vif value >10 represents collinearity. It can be eliminated once and then co-linearity is verified. If the member is removed before the co-linearity test, until there is no common linear variable. At the same time, it is best to avoid common linear variables in the final model. Of course, here the VIF option can be used with the Select option to get the variables of the model and also to perform collinearity tests.
Model validation
After the return equation can be used to verify and predict the dependent variables, you can manually write the regression equation can also be as in the previous blog discriminant analysis with proc sore for scoring predictions. Submit the Code:
DATA need; INPUT SQUARE @@; Datalines; ; RUN; PROC REG Data=ex. RETAIL noprint outest=betas; Prerev:model revenue=SQUARE; RUN; QUIT; PROC PRINT data=betas; RUN; PROC score data=need score=betas out=scored type= PARMS; VAR SQUARE; RUN; PROC PRINT; RUN;
PROC reg output Data set for estimating parameter model
As soon as the dataset is provided, the collection predicts the dependent variables, in other words we can make predictions by manually creating such datasets ourselves.
The output of the PROC score is:
This is a prediction of the dependent variable, and the revenue can also be validated by entering the square values of several known observations.
Nine algorithms for machine learning---regression