Generalized linear model R--GLM function (i.)

Source: Internet
Author: User

R language GLM function learning:

"Please specify the source when reproduced": http://www.cnblogs.com/runner-ljt/

Ljt

As a beginner, the level is limited, welcome to communicate correct.

The GLM function describes:

GLM ( formula , family= family.generator , Data,control = List (...))


Family: Each response distribution (exponential distribution family) allows various correlation functions to correlate the mean with the linear predictor .

Common family:

Binomal (link= ' logit ')----response variables are subject to two distributions, and the connection function is logit, i.e. logistic regression

Binomal (link= ' probit ')----response variable follows two-item distribution, and the connection function is Probit

Poisson (link= ' identity ')----response variable follows Poisson distribution, i.e. Poisson regression

Control: Controlling algorithm error and maximum number of iterations

Glm.control (epsilon = 1e-8, Maxit = +, trace = FALSE)

-----Maxit: Maximum number of iterations of the algorithm, changing the maximum number of iterations: Control=list (maxit=100)

The GLM function uses:

> > Data<-iris[1:100,]> samp<-sample (100,80) > Names (data) <- C (' SL ', ' SW ', ' pl ', ' pw ', ' species ') > testdata<-data[samp,]> traindata<-data[-samp,]> > LGST<-GLM (Testdata$species~pl,binomial (link= ' logit '), Data=testdata) Warning messages:1: Glm.fit: The algorithm has no aggregation 2:glm.fit: The fit rate is calculated as a value of 0 or a > summary (LGST) call:glm (formula = testdata$species ~ PL, F         amily = binomial (link = "logit"), data = testdata) deviance residuals:min 1Q Median 3Q max-1.836e-05-2.110e-08-2.110e-08 2.110e-08 1.915e-05 coefficients:estimate Std. Error z val UE Pr (>|z|) (Intercept) -83.47 88795.25-0.001 0.999pl 32.09 32635.99 0.001 0.999 (dispersion parameter for   Binomial family taken to being 1) Null deviance:1.1085e+02 on + degrees of freedomresidual deviance:1.4102e-09 on 78 Degrees of Freedomaic:4number of Fisher scoring iterations:25> 

  

  

Note When you use the GLM function for a logistic regression, a warning appears:

Warning messages:
1:glm.fit: Algorithm has no aggregation
2:glm.fit: The fitting rate is calculated as a value of 0 or a

It can also be found that the P-value of two coefficients is 0.999, which indicates that the regression coefficients are not significant.

First Warning: The algorithm does not converge.
since the regression coefficients are iterated according to the principle of maximum likelihood estimation in logistic regression, the default maximal iteration number of the GLM function is maxit=25, and when the data is not very good, after 25 iterations the possible algorithm is not convergent. Therefore, we can try to solve the problem that the algorithm does not converge by increasing the number of iterations. However, the algorithm still does not converge when the number of iterations is increased, and the data is really good, and it needs to be further processed by the singular value test of the data.

> > Lgst<-glm (testdata$species~pl,binomial (link= ' logit '), Data=testdata,control=list (maxit=100)) Warning Message:glm.fit: The fitting rate is calculated as a value of 0 or a > summary (LGST) call:glm (formula = testdata$species ~ PL, family = binomial (link = "logit "),     data = testdata, control = list (Maxit =)) deviance residuals:        Min          1Q      Median          3Q         Max  -1.114e-05  -2.110e-08  -2.110e-08   2.110e-08   1.162e-05  coefficients:             Estimate Std. Error z value Pr (>|z|) (Intercept)    -87.18  146399.32  -0.001        1pl              33.52   53808.49   0.001        1 (dispersion parameter for Binomial family taken to being 1) Null deviance:1.1085e+02 on + degrees of  freedomresidual deviance:5.1817e -10 on  78  

  

As above, the first warning is resolved by increasing the number of iterations, at which point the algorithm converges.

But the second warning still exists, and the regression coefficient p=1, still not significant.

Second warning: The probability of fitting probabilities is 0 or 1

First of all, what does this warning mean?
Let's take a look at the logist regression results of the training sample, what is the probability that each sample fits into the ' Setosa ' class?

>>LGST<-GLM (testdata$species~pl,binomial (link= ' logit '), Data=testdata,control=list (maxit=100)) >p <-predict (lgst,type= ' response ') >plot (seq ( -2,2,length=80), sort (p), col= ' Blue ') >

It can be seen that the probability of the training sample being ' setosa ' is not almost 0, or almost 1, not the S-shaped curve of the logistic model we envisioned, which is the meaning of the second warning.

So the question is, why does this happen?
(The following is only the personal understanding that I refer to some explanations, not guarantee the correct)

The appearance of this kind of situation can be understood as a kind of overfitting, because of the data, in the optimization search process of regression coefficients, the linear fitting values of classification belong to one kind of class (Y=1) tend to be large, and the linear fitting values of the classification type of another class (y=0) tend to be small.

Because the principle of maximum likelihood estimation is used when solving regression coefficients, the regression coefficients make the likelihood function maximal in the search process:

Therefore, in the search process, we tend to make Y=1 h (x) tend to be large, while the y=0 h (x) tends to be small.

That is, the coefficient θ makes the-θtx of Y=1 class tend to be large, so that the-θtx of y=0 class tends to be small. This results in P (y=1|x;θ)-->1, P (y=0|x;θ)-->0.

So the question comes again, what kind of data will lead to such overfitting?

Let's take a look at the sample PL values of the types Setosa and versicolor in the logistic regression above. (The horizontal axis represents the PL value, in order to avoid overlapping of sample PL data points, an unrelated Y value is added to expand the sample point)

It can be seen that two kinds of data are obviously linearly divided .

So as long as the absolute value of the slope of the unary linear function h (x) is large, the Y=1 class H (x) tends to be large, and the Y=0 class H (x) tends to be small in the regression coefficient search process.

So when the sample data is linearly separable, logistic regression often leads to overfitting, that is, a second warning: the probability of fitting probability is 0 or 1.

In the case of a second warning, the logistic model is often not applicable, and for this kind of linearly-divided sample data, the method of directly using rule judgment is simple and applicable (for example, when pl<2.5, it is directly judged as Setosa class,pl> The Versicolor class is judged at 2.5).

Below, the logistic regression process is demonstrated for two-dimensional training data that is linearly non-divided.

> > Data<-iris[51:150,]> samp<-sample (100,80) > Names (data) <-c (' SL ', ' SW ', ' pl ', ' pw ', ' species ') > testdata<-data[samp,]> traindata<-data[-samp,]>> lgst<-glm (testdata$species~sw+pw,binomial (link= ' logit '), Data=testdata) > Summary (LGST) call:glm (formula = testdata$species ~ SW + PW, family = binomial (link = "Lo Git "), data = testdata) deviance residuals:min 1Q Median 3Q max-1.82733-0.16423 0.00    429 0.11512 2.12846 coefficients:estimate Std. Error z value Pr (>|z|) (Intercept) -12.915 5.021-2.572 0.0101 * sw-3.796 1.760-2.156 0.0310 * PW 14.7 3.642 4.046 5.21e-05 * * *---signif. codes:0 ' * * * ' 0.001 ' * * ' 0.01 ' * ' 0.05 '. ' 0.1 ' 1 (dispersion parameter for binomial family taken to being 1) Null deviance:110.85 on + degrees of Freedomresid UAL deviance:24.40 on degrees of freedomaic:30.4number of Fisher scoring iterations:7> #画拟合概Rate Graph > p<-predict (lgst,type= ' response ') > Plot (seq ( -2,2,length=80), sort (p), col= ' Blue ') >> #画训练样本数据散点图 >a<-testdata$species== ' versicolor ' > x1<-testdata[a, ' SW ']> y1<-testdata[a, ' PW ']> x2<-  Testdata[!a, ' SW ']> y2<-testdata[!a, ' PW ']> summary (TESTDATA$SW) Min. 1st Qu.    Median Mean 3rd Qu.   Max.  2.000 2.700 2.900 2.881 3.100 3.800 > Summary (TESTDATA$PW) Min. 1st Qu.    Median Mean 3rd Qu.   Max. 1.000 1.300 1.600 1.672 2.000 2.500 > > Plot (X1,y1,xlim=c (1.5,4), Ylim=c (. 05,3), xlab= ' SW ', ylab= ' pw ', col= ' bl UE ') > points (x2,y2,col= ' red ') > > #画分类边界图, which is an image of H (x) =0.5 > X3<-seq (1.5,4,length=100) > y3<-(3.796/ 14.735) *x3+13.415/14.735> lines (X3,Y3)

  

Fit Probability curve:

(S-shaped curve that basically conforms to the logistic model)

Training sample scatter plots and classification boundaries:

(The classification boundary of the drawing logistic regression is the drawing curve h (x) =0.5)

Generalized linear model R--GLM function (i.)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.