Machine Learning Quick Start (2)
Machine Learning Quick Start (2)-Classification
Abstract: This article briefly describes how to use a classification algorithm to evaluate the bank's loan issuance model.
Statement: (the content of this article is not original, but it has been translated and summarized by myself. Please indicate the source for reprinting)
The content of this article Source: https://www.dataquest.io/mission/57/classification-basics
When you apply for a credit card or loan from a bank, the Bank uses a model based on past data and determines whether to accept your application based on your actual situation.
Raw data presentation
In the past, banks granted loans to all applicants to train this scoring model, and then recorded whether all the applicants had made payments and their ratings, the paid field indicates whether the applicant has actually paid the loan. 1 indicates normal payment, and 0 indicates default. The model_score field indicates the score of the applicant in the score model before obtaining the loan.
import pandascredit = pandas.read_csv("credit.csv")
# Set the threshold value to 0.5 to calculate the accuracy of this model (the score is greater than 0.5, and the actual repayment is true) pred = credit ["model_score"]> 0.5 accuracy = sum (pred = credit ["paid"])/len (pred) # The paid field is 1 and model_score is greater than 0.5
Prediction class
Class = 1
Class = 0
Actual class
Class = 1
F11
F10
Class = 0
F01
F00
The confusion matrix in this article is as follows:
|
Actually, it will be paid back. |
Actually, it will not be repaid. |
Predicted repayment |
TP |
FP |
Predicted No repayment |
FN |
TN |
Real (True Positive, TP) Positive samples predicted by the model;
False Negative (FN) positive samples predicted as Negative by the model;
False Positive (FP) is predicted as a Positive negative sample by the model;
True Negative (TN) is a Negative sample predicted as Negative by the model.
True Positive Rate, TPR, or sensitivity: TPR = TP/(TP + FN) (number of Positive sample prediction results/actual number of Positive samples)
False Negative Rate (False Negative Rate, FNR): FNR = FN/(TP + FN) (number of predicted Negative positive sample results/actual number of positive samples)
False Positive Rate (False Positive Rate, FPR): FPR = FP/(FP + TN) (number of predicted Positive negative samples/actual number of negative samples)
True Negative Rate (TNR) or specificity: TNR = TN/(TN + FP) (number of Negative sample prediction results/actual number of Negative samples)
# Set the threshold value to 0.5 and calculate the preceding confusion matrix TP = sum (credit ['model _ score ']> 0.5) = 1) & (credit ['paid'] = 1) FN = sum (credit ['model _ score '] <= 0.5) = 1) & (credit ['paid'] = 1) FP = sum (credit ['model _ score ']> 0.5) = 1) & (credit ['paid'] = 0) TN = sum (credit ['model _ score '] <= 0.5) = 1) & (credit ['paid'] = 0 ))
As long as the actual rate (TPR) in our model is greater than the false positive rate (FPR), we can ensure that there are more people than those who default, so that the bank will not lose money.
ROC curve Calculation
The ROC curve is also known as the sensitiequalcurve curve ). The reason for this is that the points on the curve reflect the same sensitivity. They are all responses to the same signal stimulus, but only the results obtained under several different criteria. The receiver operation characteristic curve is a coordinate chart composed of the false probability as the horizontal axis and the hit probability as the vertical axis, the curve drawn from different results obtained by different judgment criteria under specific stimulus Conditions
As described above, we are now looking for a threshold to make the true rate greater than the false positive rate
Import numpydef roc_curve (observed, probs): # divide the threshold value from 1 to 0 into 100 decimal places thresholds = numpy. asarray ([(100-j)/100 for j in range (100)]) # initialize to all 0 fprs = numpy. asarray ([0. for j in range (100)]) tprs = numpy. asarray ([0. for j in range (100)]) # cycle each threshold value for j, thres in enumerate (thresholds ): pred = probs> thres FP = sum (observed = 0) & (pred = 1) TN = sum (observed = 0) & (pred = 0) FPR = float (FP/(FP + TN) TP = sum (observed = 1) & (pred = 1 )) FN = sum (observed = 1) & (pred = 0) TPR = float (TP/(TP + FN )) fprs [j] = FPR tprs [j] = TPR return fprs, tprs, thresholdsfpr, tpr, thres = roc_curve (credit ["paid"], credit ["model_score"]) idx = numpy. where (fpr> 0.20) [0] [0] # select a threshold with a false positive rate of 0.2 for the following print ('fpr: 100') print ('tpr: {}'. format (tpr [idx]) print ('threashold :{}'. format (thres [idx]) # use false positive rate as the X axis, and use True rate as the Y axis to plot plt. plot (fpr, tpr) plt. xlabel ('fpr') plt. ylabel ('tpr ') plt. show ()
It indicates that when the threshold is set to 0.38, FPR = 0.2, TPR = 0.93
# A simple example from sklearn. metrics import roc_auc_scoreprobs = [0.98200848, 0.92088976, 0.13125231, 0.0130085, 0.35719083, 0.34381803, 0.46938187, 0.53918899, 0.63485958] obs = [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1] testing_auc = roc_auc_score (obs, probs) print ("Example AUC: {auc }". format (auc = testing_auc ))
Auc = roc_auc_score (credit ["paid"], credit ["model_score"])
# Calculate pred = credit ["model_score"]> 0.5 # True PositivesTP = sum (pred = 1) & (credit ["paid"] = 1) # False PositivesFP = sum (pred = 0) & (credit ["paid"] = 1) # False NegativesFN = sum (pred = 1) & (credit ["paid"] = 0) precision = TP/(TP + FP) recall = TP/(TP + FN) print ('precision: {}'. format (precision) print ('recall :{}'. format (recall ))
From sklearn. metrics import precision_recall_curvepretries, recall, thresholds = precision_recall_curve (credit ["paid"], credit ["model_score"]) plt. plot (recall, precision) plt. xlabel ('recall') plt. ylabel ('precision ') plt. show ()
In the middle, the curve suddenly drops when Recall = 0.8. At this time, Precision = 0.9 indicates that only a small number of potential customers are lost, and the default rate is also very low.