Machine Learning Quick Start (2)

Source: Internet
Author: User

Machine Learning Quick Start (2)

Machine Learning Quick Start (2)-Classification

Abstract: This article briefly describes how to use a classification algorithm to evaluate the bank's loan issuance model.

Statement: (the content of this article is not original, but it has been translated and summarized by myself. Please indicate the source for reprinting)

The content of this article Source: https://www.dataquest.io/mission/57/classification-basics

When you apply for a credit card or loan from a bank, the Bank uses a model based on past data and determines whether to accept your application based on your actual situation.

 

Raw data presentation

In the past, banks granted loans to all applicants to train this scoring model, and then recorded whether all the applicants had made payments and their ratings, the paid field indicates whether the applicant has actually paid the loan. 1 indicates normal payment, and 0 indicates default. The model_score field indicates the score of the applicant in the score model before obtaining the loan.

import pandascredit = pandas.read_csv("credit.csv")

# Set the threshold value to 0.5 to calculate the accuracy of this model (the score is greater than 0.5, and the actual repayment is true) pred = credit ["model_score"]> 0.5 accuracy = sum (pred = credit ["paid"])/len (pred) # The paid field is 1 and model_score is greater than 0.5

 

Prediction class

Class = 1

Class = 0

Actual class

Class = 1

F11

F10

Class = 0

F01

F00

The confusion matrix in this article is as follows:

 

Actually, it will be paid back.

Actually, it will not be repaid.

Predicted repayment

TP

FP

Predicted No repayment

FN

TN

 

Real (True Positive, TP) Positive samples predicted by the model;

False Negative (FN) positive samples predicted as Negative by the model;

False Positive (FP) is predicted as a Positive negative sample by the model;

True Negative (TN) is a Negative sample predicted as Negative by the model.

 

True Positive Rate, TPR, or sensitivity: TPR = TP/(TP + FN) (number of Positive sample prediction results/actual number of Positive samples)

False Negative Rate (False Negative Rate, FNR): FNR = FN/(TP + FN) (number of predicted Negative positive sample results/actual number of positive samples)

False Positive Rate (False Positive Rate, FPR): FPR = FP/(FP + TN) (number of predicted Positive negative samples/actual number of negative samples)

True Negative Rate (TNR) or specificity: TNR = TN/(TN + FP) (number of Negative sample prediction results/actual number of Negative samples)

# Set the threshold value to 0.5 and calculate the preceding confusion matrix TP = sum (credit ['model _ score ']> 0.5) = 1) & (credit ['paid'] = 1) FN = sum (credit ['model _ score '] <= 0.5) = 1) & (credit ['paid'] = 1) FP = sum (credit ['model _ score ']> 0.5) = 1) & (credit ['paid'] = 0) TN = sum (credit ['model _ score '] <= 0.5) = 1) & (credit ['paid'] = 0 ))

As long as the actual rate (TPR) in our model is greater than the false positive rate (FPR), we can ensure that there are more people than those who default, so that the bank will not lose money.

 

ROC curve Calculation

The ROC curve is also known as the sensitiequalcurve curve ). The reason for this is that the points on the curve reflect the same sensitivity. They are all responses to the same signal stimulus, but only the results obtained under several different criteria. The receiver operation characteristic curve is a coordinate chart composed of the false probability as the horizontal axis and the hit probability as the vertical axis, the curve drawn from different results obtained by different judgment criteria under specific stimulus Conditions

As described above, we are now looking for a threshold to make the true rate greater than the false positive rate

Import numpydef roc_curve (observed, probs): # divide the threshold value from 1 to 0 into 100 decimal places thresholds = numpy. asarray ([(100-j)/100 for j in range (100)]) # initialize to all 0 fprs = numpy. asarray ([0. for j in range (100)]) tprs = numpy. asarray ([0. for j in range (100)]) # cycle each threshold value for j, thres in enumerate (thresholds ): pred = probs> thres FP = sum (observed = 0) & (pred = 1) TN = sum (observed = 0) & (pred = 0) FPR = float (FP/(FP + TN) TP = sum (observed = 1) & (pred = 1 )) FN = sum (observed = 1) & (pred = 0) TPR = float (TP/(TP + FN )) fprs [j] = FPR tprs [j] = TPR return fprs, tprs, thresholdsfpr, tpr, thres = roc_curve (credit ["paid"], credit ["model_score"]) idx = numpy. where (fpr> 0.20) [0] [0] # select a threshold with a false positive rate of 0.2 for the following print ('fpr: 100') print ('tpr: {}'. format (tpr [idx]) print ('threashold :{}'. format (thres [idx]) # use false positive rate as the X axis, and use True rate as the Y axis to plot plt. plot (fpr, tpr) plt. xlabel ('fpr') plt. ylabel ('tpr ') plt. show ()

It indicates that when the threshold is set to 0.38, FPR = 0.2, TPR = 0.93

# A simple example from sklearn. metrics import roc_auc_scoreprobs = [0.98200848, 0.92088976, 0.13125231, 0.0130085, 0.35719083, 0.34381803, 0.46938187, 0.53918899, 0.63485958] obs = [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1] testing_auc = roc_auc_score (obs, probs) print ("Example AUC: {auc }". format (auc = testing_auc ))

Auc = roc_auc_score (credit ["paid"], credit ["model_score"])

# Calculate pred = credit ["model_score"]> 0.5 # True PositivesTP = sum (pred = 1) & (credit ["paid"] = 1) # False PositivesFP = sum (pred = 0) & (credit ["paid"] = 1) # False NegativesFN = sum (pred = 1) & (credit ["paid"] = 0) precision = TP/(TP + FP) recall = TP/(TP + FN) print ('precision: {}'. format (precision) print ('recall :{}'. format (recall ))

From sklearn. metrics import precision_recall_curvepretries, recall, thresholds = precision_recall_curve (credit ["paid"], credit ["model_score"]) plt. plot (recall, precision) plt. xlabel ('recall') plt. ylabel ('precision ') plt. show ()

In the middle, the curve suddenly drops when Recall = 0.8. At this time, Precision = 0.9 indicates that only a small number of potential customers are lost, and the default rate is also very low.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.