This log is indeed a trigger. I am not familiar with R, but it is required by the experiment, so I just learned it. We found that, whether it's countless tutorials on the Internet or examples in books, when talking about logistic regression, we will give a simple function and a description of the output results. I have never been clear about several things:
1. How to Use training data to train the model and then verify the test data (the test data and training data may overlap )?
2. How to calculate the prediction result, that is, calculate the recall, precision, and F-measure values?
3. How to calculate evaluation indicators such as Nagelkerke goodness of fit?
I found these books and some blog-writing friends with an unclear mind. Let's look at your tutorial. Instead of simply looking at the use of simple functions, or listening to you to explain the principles, we still hope to use them as soon as possible and correctly. From my experience, the existing online tutorials are too poor.
I will not describe the process in detail here. I believe you will understand it at a Glance:
Train ("training.csv", header?false=testing=read.csv ("testing.csv", header = false) # import training and test data respectively GLM. Fit = GLM (V16 ~ V7, Data = training, family = binomial (link = "Logit") # generate a model using training data. Here I Use 7th columns of data to predict 16th columns. n = nrow (training) # Number of training data rows, that is, the number of samples R2 <-1-exp (GLM. fit $ deviance-glm.fit $ null. deviance)/n) # Calculate Cox-Snell goodness of fit CAT ("Cox-Snell r2 =", R2, "\ n ") r2 <-R2/(1-exp (-GLM. fit $ null. deviance)/n) # Calculate the Nagelkerke goodness of fit. At the end, we output this goodness of fit value p = predict (GLM. fit, testing) # use a model to predict the test data. P = exp (P)/(1 + exp (p )) # calculate the value of the dependent variable testing $ v16_predicted = 1 * (P> 0.5) # Add a column to the test data, that is, the prediction of V16. When P> 0.5, predicted value: 1 true_value = testing [, 16] predict_value = testing [, 17] # retrieve 16 and 17 columns respectively retrieved = sum (predict_value) precision = sum (true_value & predict_value) /retrievedrecall = sum (predict_value & true_value)/sum (true_value) f_measure = 2 * precision * recall/(precision + recall) # Calculate recall, precision, and F-measure Summary (GLM. fit) CAT ("Nagelkerke R2 =", R2, "\ n") print (precision) print (recall) print (f_measure)
I don't know why many people are confused about such a simple thing.
Here we will briefly explain the output result of summary:
Call:glm(formula = V16 ~ V7, family = binomial(link = "logit"), data = training) Deviance Residuals: Min 1Q Median 3Q Max-2.5212 -0.9990 -0.4249 1.1352 1.4978 Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) -0.744804 0.207488 -3.590 0.000331 ***V7 0.005757 0.001362 4.226 2.38e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 307.76 on 221 degrees of freedomResidual deviance: 277.85 on 220 degrees of freedomAIC: 281.85 Number of Fisher Scoring iterations: 5
In fact, you can see coefficient here. estimate indicates the coefficient of V7 in the final prediction equation, and PR is P-value. The prediction results are acceptable from these two points.
The most detailed logistic regression (Logistic regression) source code based on the R language, including fit optimization, recall, precision computing