The logistic regression of R language

Source: Internet
Author: User

This paper mainly introduces the realization of logistic regression, the test of model, etc.

Reference Blog http://blog.csdn.net/tiaaaaa/article/details/58116346;http://blog.csdn.net/ai_vivi/article/details/43836641

1. Test set and training set (3:7 scale) data source: http://archive.ics.uci.edu/ml/datasets/statlog+ (Australian+credit+approval)

Austra=read.table ("Australian.dat") head (Austra) #预览前6行N =length (austra$v15) #690行, 15 columns #ind=1,ind= 2 Ind=sample (2,n,replace=true,prob=c (0.7,0.3)) aus_train=austra[ind==1,]aus_test=austra[ind==2, respectively, with the probability of 0.7,0.3,]

  

2. Realization and prediction of logistic regression

PRE=GLM (V15~.,data=aus_train,family=binomial (link= "logit")) Summary (PRE) REAL=AUS_TEST$V15PREDICT_=PREDICT.GLM ( Pre,type= "Response", Newdata=aus_test) Predict=ifelse (predict_>0.5,1,0) aus_test$predict=predicthead (aus_test) #write. csv (aus_test, "Aus_test.csv")

  

3. Model Checking

Res=data.frame (real,predict) n=nrow (Aus_train)
#计算Cox-snell goodness of Fit r2=1-exp ((pre$deviance-pre$null.deviance)/N) Cat ("Cox-snell r2=", R2, "\ n")
#计算Nagelkerke拟合优度R2 =r2/(1-exp (-pre$null.deviance)/N) Cat ("Nagelkerke r2=", R2, "\ n") #Nagelkerke r2= 0.7379711 #模型其他指标 #residuals (PRE) #残差 #coefficients (PRE) #系数 #anova (pre) #方差

  

4. Accuracy and precision

TRUE_VALUE=AUS_TEST[,15]PREDICT_VALUE=AUS_TEST[,16] #计算模型精度error =predict_value-true_value# to determine the proportion  of the correct number in the total Accuracy= (Nrow (aus_test)-sum (ABS (Error)))/nrow (aus_test) #混淆矩阵中的量 (The confusion matrix is explained on the next page) #真实值预测值全为1/predicted value is all 1---the correct number of information extracted Number of extracted information bars  precision=sum (True_value & Predict_value)/sum (Predict_value) #真实值预测值全为1/True value is all 1---the correct number of messages extracted The number of information bars in the sample  Recall=sum (Predict_value & True_value)/sum (true_value) #P和R指标有时候会出现的矛盾的情况, so you need to consider them comprehensively, The most common method is F-measure (also known as F-score) f_measure=2*precision*recall/(precision+recall)    # F-measure is a weighted harmonic average of precision and recall and is a comprehensive evaluation indicator  #输出以上各结果  print (accuracy)  print (precision)  print ( Recall)  print (f_measure)  #混淆矩阵 with the results of TP, FN, FP, TN  

  

5. ROC Curve

#ROC曲线 (ROC Curve detailed explanation see next page) # method 1  #install. Packages ("ROCR")    library (ROCR)       pred <-Prediction (predict_,true _value)   #预测值 (predicted values before 0.5 two classification) and real values     performance (pred, ' AUC ') @y.values        #AUC值 perf <-  Performance ( pred, ' TPR ', ' FPR ')  plot (perf)  #方法2  #install. Packages ("PROC")  library (PROC)  Modelroc <- ROC (true_value,predict.)  Plot (Modelroc, Print.auc=true, Auc.polygon=true,legacy.axes=true, Grid=c (0.1, 0.2),       grid.col=c ("Green", "red") , Max.auc.polygon=true,       auc.polygon.col= "Skyblue", Print.thres=true)        #画出ROC曲线, mark the coordinates, and mark the value of the  AUC #方法3, Define  tpr=rep (0,1000)  fpr=rep (0,1000) p=predict by Roc  .  For (i in 1:1000)    {     p0=i/1000;    Ypred<-1* (p>p0)      tpr[i]=sum (ypred*true_value)/sum (true_value)      fpr[i]=sum (ypred* (1-true_value))/ SUM (1-true_value)    }  Plot (fpr,tpr,type= "L", col=2)  points (c (0,1), C (0,1), type= "L", lty=2)  

  

6. Replace the test set and training set selection method, with 10 cross-validation

Australian <-read.table ("Australian.dat") #将australian数据分成随机十等分 #install. Packages ("caret") #固定folds函数的分组 Set.seed (7) library (caret) folds <-createfolds (y=australian$v15,k=10) #构建for循环, 10 cross-validated test set accuracy, training set accuracy max=0 n Um=0 for (i in 1:10) {fold_test <-australian[folds[[i],] #取folds [[i]] as test set Fold_train <-australian[ -folds[[i]] # The rest of the data as training set print ("* * * * * * * * *") Fold_pre <-GLM (V15 ~.,family=binomial (link= ' logit '), data= Fold_train) fold_predict <-predict (fold_pre,type= ' response ', newdata=fold_test) fold_predict =ifelse (fold_predic t>0.5,1,0) fold_test$predict = fold_predict Fold_error = fold_test[,16]-fold_test[,15] Fold_accuracy = (Nrow (f old_test)-sum (ABS (Fold_error))/nrow (fold_test) print (i) Print ("Test set accuracy * * * * * *") print (fold_accuracy) print ("* * * Training Set Accuracy * * * ") Fold_predict2 <-Predict (fold_pre,type= ' response ', Newdata=fold_train) fold_predict2 =ifelse (FOLD_PR edict2>0.5,1,0) FOLD_TRAIN$PREdict = Fold_predict2 Fold_error2 = fold_train[,16]-fold_train[,15] Fold_accuracy2 = (Nrow (fold_train)-sum (ABS (fold_        ERROR2))/nrow (fold_train) print (fold_accuracy2) if (Fold_accuracy>max) {max=fold_accuracy  Num=i}} print (max) print (num) # #结果可以看到, precision accuracy max, take Folds[[num]] as the test set, the rest as the training set.

  

7.10 Cross-validation accuracy

#十折里测试集最大精确度的结果  testi <-Australian[folds[[num]] [  traini <-Australian[-folds[[num]],]   # The remaining folds as training set  Prei <-glm (V15 ~.,family=binomial (link= ' logit '), Data=traini)  predicti <-PREDICT.GLM ( Prei,type= ' response ', newdata=testi)  predicti =ifelse (predicti>0.5,1,0)  testi$predict = predicti  # Write.csv (Testi, "Ausfold_test.csv")  Errori = testi[,16]-testi[,15]  accuracyi = (Nrow (testi)-sum (ABS (Errori ))/nrow (testi)     #十折里训练集的精确度  predicti2 <-predict.glm (prei,type= ' response ', Newdata=traini)  Predicti2 =ifelse (predicti2>0.5,1,0)  traini$predict = predicti2  errori2 = traini[,16]-traini[,15]  Accuracyi2 = (Nrow (traini)-sum (ABS (ERRORI2)))/nrow (Traini)     #测试集精确度, take the group I, training set accurate  accuracyi;num; Accuracyi2  #write. csv (Traini, "Ausfold_train.csv")  

  

Confusion matrix
Forecast
1 0
Real 1 True Positive (TP) True Negative (TN) Actual Positive (TP+TN)
Inter - 0 False Positive (FP) False Negative (FN) Actual Negative (FP+FN)
Predicted Positive (TP+FP) Predicted negative (TN+FN) (TP+TN+FP+FN)

accuracyrate (accuracy rate ) : (TP+TN)/(TP+TN+FN+FP)

errorrate (Rate of error ) : (FN+FP)/(TP+TN+FN+FP)

Recall (recall rate, recall , hit probability ) : tp/(TP+FN), How many of the Groundtruth are identified as positive samples ;

Precision (Precision ): tp/(TP+FP), how many of the identified positive samples are true positive samples;

TPR (truepositive rate): tp/(TP+FN), is actually Recall

Far (falseacceptance rate) or FPR (False Positive rate): fp/(FP+TN), error reception rate, false alarm rate, in all groundtruth negative samples of how many are identified as positive samples ;

FRR (falserejection rate): fn/(TP+FN), error rejection rate, rejection truthful, how many of the Groundtruth are positive samples are recognized as negative samples, it equals 1-recall

ROC Curve (receiver operating characteristic curve)

  1. The transverse axis is far and the longitudinal axes are Recall;

  2. The recognition result of each threshold corresponds to a point (FPR,TPR), when the threshold value is maximum, all samples are identified as negative samples, corresponding to the upper right corner of the point (0,0), when the threshold value is minimal, all samples are identified as positive samples, corresponding to the upper right corner of the point (in), As the threshold changes from maximum to minimum,both TP and FP increase gradually.

  3. A good classification model should be located in the upper left corner of the image as much as possible, while a stochastic guessing model should be locatedon the main diagonal of the connection point (tpr=0,fpr=0) and (tpr=1,fpr=1);

  4. The algorithm can be measured using the area AUC (areaunder roc Curve) value below the ROC curve: If the model is perfect, then its = 1, if the model is a simple random guessing model, then its = 0.5, if one model is better than the other, it has a relatively large area below the curve;

  5. ERR (Equal error rate, equal error rates): Farand FRR are two parameters of the same algorithm system, placing it in the same coordinate. Far is decreased with the increase of the threshold value, and theFRR is increased with the increase of the threshold value. So they must have intersections. This point is the point of the far and FRR equivalent under a certain threshold value . It is customary to use this value to measure the overall performance of the algorithm. For a better fingerprint algorithm, it is hoped that in the same threshold conditions, farand frr are smaller and better.

The logistic regression of R language

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.