Generalized linear model R--GLM function (i.)

Last Update:2015-06-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

R language GLM function learning:

"Please specify the source when reproduced": http://www.cnblogs.com/runner-ljt/

Ljt

As a beginner, the level is limited, welcome to communicate correct.

The GLM function describes:

GLM ( formula , family= family.generator , Data,control = List (...))


 Family: Each response distribution (exponential distribution family) allows various correlation functions to correlate the mean with the linear predictor .

Common family:

Binomal (link= ' logit ')----response variables are subject to two distributions, and the connection function is logit, i.e. logistic regression

Binomal (link= ' probit ')----response variable follows two-item distribution, and the connection function is Probit

Poisson (link= ' identity ')----response variable follows Poisson distribution, i.e. Poisson regression

Control: Controlling algorithm error and maximum number of iterations

Glm.control (epsilon = 1e-8, Maxit = +, trace = FALSE)

-----Maxit: Maximum number of iterations of the algorithm, changing the maximum number of iterations: Control=list (maxit=100)

The GLM function uses:

> > Data<-iris[1:100,]> samp<-sample (100,80) > Names (data) <- C (' SL ', ' SW ', ' pl ', ' pw ', ' species ') > testdata<-data[samp,]> traindata<-data[-samp,]> > LGST<-GLM (Testdata$species~pl,binomial (link= ' logit '), Data=testdata) Warning messages:1: Glm.fit: The algorithm has no aggregation 2:glm.fit: The fit rate is calculated as a value of 0 or a > summary (LGST) call:glm (formula = testdata$species ~ PL, F         amily = binomial (link = "logit"), data = testdata) deviance residuals:min 1Q Median 3Q max-1.836e-05-2.110e-08-2.110e-08 2.110e-08 1.915e-05 coefficients:estimate Std. Error z val UE Pr (>|z|) (Intercept) -83.47 88795.25-0.001 0.999pl 32.09 32635.99 0.001 0.999 (dispersion parameter for   Binomial family taken to being 1) Null deviance:1.1085e+02 on + degrees of freedomresidual deviance:1.4102e-09 on 78 Degrees of Freedomaic:4number of Fisher scoring iterations:25>

Note When you use the GLM function for a logistic regression, a warning appears:

Warning messages:
1:glm.fit: Algorithm has no aggregation
2:glm.fit: The fitting rate is calculated as a value of 0 or a

It can also be found that the P-value of two coefficients is 0.999, which indicates that the regression coefficients are not significant.

First Warning: The algorithm does not converge.
since the regression coefficients are iterated according to the principle of maximum likelihood estimation in logistic regression, the default maximal iteration number of the GLM function is maxit=25, and when the data is not very good, after 25 iterations the possible algorithm is not convergent. Therefore, we can try to solve the problem that the algorithm does not converge by increasing the number of iterations. However, the algorithm still does not converge when the number of iterations is increased, and the data is really good, and it needs to be further processed by the singular value test of the data.

> > Lgst<-glm (testdata$species~pl,binomial (link= ' logit '), Data=testdata,control=list (maxit=100)) Warning Message:glm.fit: The fitting rate is calculated as a value of 0 or a > summary (LGST) call:glm (formula = testdata$species ~ PL, family = binomial (link = "logit "),     data = testdata, control = list (Maxit =)) deviance residuals:        Min          1Q      Median          3Q         Max  -1.114e-05  -2.110e-08  -2.110e-08   2.110e-08   1.162e-05  coefficients:             Estimate Std. Error z value Pr (>|z|) (Intercept)    -87.18  146399.32  -0.001        1pl              33.52   53808.49   0.001        1 (dispersion parameter for Binomial family taken to being 1) Null deviance:1.1085e+02 on + degrees of  freedomresidual deviance:5.1817e -10 on  78

As above, the first warning is resolved by increasing the number of iterations, at which point the algorithm converges.

But the second warning still exists, and the regression coefficient p=1, still not significant.

Second warning: The probability of fitting probabilities is 0 or 1

First of all, what does this warning mean?
Let's take a look at the logist regression results of the training sample, what is the probability that each sample fits into the ' Setosa ' class?

>>LGST<-GLM (testdata$species~pl,binomial (link= ' logit '), Data=testdata,control=list (maxit=100)) >p <-predict (lgst,type= ' response ') >plot (seq ( -2,2,length=80), sort (p), col= ' Blue ') >

It can be seen that the probability of the training sample being ' setosa ' is not almost 0, or almost 1, not the S-shaped curve of the logistic model we envisioned, which is the meaning of the second warning.

So the question is, why does this happen?
(The following is only the personal understanding that I refer to some explanations, not guarantee the correct)

The appearance of this kind of situation can be understood as a kind of overfitting, because of the data, in the optimization search process of regression coefficients, the linear fitting values of classification belong to one kind of class (Y=1) tend to be large, and the linear fitting values of the classification type of another class (y=0) tend to be small.

Because the principle of maximum likelihood estimation is used when solving regression coefficients, the regression coefficients make the likelihood function maximal in the search process:

Therefore, in the search process, we tend to make Y=1 h (x) tend to be large, while the y=0 h (x) tends to be small.

That is, the coefficient θ makes the-θtx of Y=1 class tend to be large, so that the-θtx of y=0 class tends to be small. This results in P (y=1|x;θ)-->1, P (y=0|x;θ)-->0.

So the question comes again, what kind of data will lead to such overfitting?

Let's take a look at the sample PL values of the types Setosa and versicolor in the logistic regression above. (The horizontal axis represents the PL value, in order to avoid overlapping of sample PL data points, an unrelated Y value is added to expand the sample point)

It can be seen that two kinds of data are obviously linearly divided .

So as long as the absolute value of the slope of the unary linear function h (x) is large, the Y=1 class H (x) tends to be large, and the Y=0 class H (x) tends to be small in the regression coefficient search process.

So when the sample data is linearly separable, logistic regression often leads to overfitting, that is, a second warning: the probability of fitting probability is 0 or 1.

In the case of a second warning, the logistic model is often not applicable, and for this kind of linearly-divided sample data, the method of directly using rule judgment is simple and applicable (for example, when pl<2.5, it is directly judged as Setosa class,pl> The Versicolor class is judged at 2.5).

Below, the logistic regression process is demonstrated for two-dimensional training data that is linearly non-divided.

> > Data<-iris[51:150,]> samp<-sample (100,80) > Names (data) <-c (' SL ', ' SW ', ' pl ', ' pw ', ' species ') > testdata<-data[samp,]> traindata<-data[-samp,]>> lgst<-glm (testdata$species~sw+pw,binomial (link= ' logit '), Data=testdata) > Summary (LGST) call:glm (formula = testdata$species ~ SW + PW, family = binomial (link = "Lo Git "), data = testdata) deviance residuals:min 1Q Median 3Q max-1.82733-0.16423 0.00    429 0.11512 2.12846 coefficients:estimate Std. Error z value Pr (>|z|) (Intercept) -12.915 5.021-2.572 0.0101 * sw-3.796 1.760-2.156 0.0310 * PW 14.7 3.642 4.046 5.21e-05 * * *---signif. codes:0 ' * * * ' 0.001 ' * * ' 0.01 ' * ' 0.05 '. ' 0.1 ' 1 (dispersion parameter for binomial family taken to being 1) Null deviance:110.85 on + degrees of Freedomresid UAL deviance:24.40 on degrees of freedomaic:30.4number of Fisher scoring iterations:7> #画拟合概Rate Graph > p<-predict (lgst,type= ' response ') > Plot (seq ( -2,2,length=80), sort (p), col= ' Blue ') >> #画训练样本数据散点图 >a<-testdata$species== ' versicolor ' > x1<-testdata[a, ' SW ']> y1<-testdata[a, ' PW ']> x2<-  Testdata[!a, ' SW ']> y2<-testdata[!a, ' PW ']> summary (TESTDATA$SW) Min. 1st Qu.    Median Mean 3rd Qu.   Max.  2.000 2.700 2.900 2.881 3.100 3.800 > Summary (TESTDATA$PW) Min. 1st Qu.    Median Mean 3rd Qu.   Max. 1.000 1.300 1.600 1.672 2.000 2.500 > > Plot (X1,y1,xlim=c (1.5,4), Ylim=c (. 05,3), xlab= ' SW ', ylab= ' pw ', col= ' bl UE ') > points (x2,y2,col= ' red ') > > #画分类边界图, which is an image of H (x) =0.5 > X3<-seq (1.5,4,length=100) > y3<-(3.796/ 14.735) *x3+13.415/14.735> lines (X3,Y3)

Fit Probability curve:

(S-shaped curve that basically conforms to the logistic model)

Training sample scatter plots and classification boundaries:

(The classification boundary of the drawing logistic regression is the drawing curve h (x) =0.5)

Generalized linear model R--GLM function (i.)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Generalized linear model R--GLM function (i.)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Generalized linear model R--GLM function (i.)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support