Machine Learning Quick Start (2)

Last Update:2016-03-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Machine Learning Quick Start (2)-Classification

Abstract: This article briefly describes how to use a classification algorithm to evaluate the bank's loan issuance model.

Statement: (the content of this article is not original, but it has been translated and summarized by myself. Please indicate the source for reprinting)

The content of this article Source: https://www.dataquest.io/mission/57/classification-basics

When you apply for a credit card or loan from a bank, the Bank uses a model based on past data and determines whether to accept your application based on your actual situation.

Raw data presentation

In the past, banks granted loans to all applicants to train this scoring model, and then recorded whether all the applicants had made payments and their ratings, the paid field indicates whether the applicant has actually paid the loan. 1 indicates normal payment, and 0 indicates default. The model_score field indicates the score of the applicant in the score model before obtaining the loan.

import pandascredit = pandas.read_csv("credit.csv")

# Set the threshold value to 0.5 to calculate the accuracy of this model (the score is greater than 0.5, and the actual repayment is true) pred = credit ["model_score"]> 0.5 accuracy = sum (pred = credit ["paid"])/len (pred) # The paid field is 1 and model_score is greater than 0.5

Prediction class

Class = 1

Class = 0

Actual class

Class = 1

F11

F10

Class = 0

F01

F00

The confusion matrix in this article is as follows:

	Actually, it will be paid back.	Actually, it will not be repaid.
Predicted repayment	TP	FP
Predicted No repayment	FN	TN

Real (True Positive, TP) Positive samples predicted by the model;

False Negative (FN) positive samples predicted as Negative by the model;

False Positive (FP) is predicted as a Positive negative sample by the model;

True Negative (TN) is a Negative sample predicted as Negative by the model.

True Positive Rate, TPR, or sensitivity: TPR = TP/(TP + FN) (number of Positive sample prediction results/actual number of Positive samples)

False Negative Rate (False Negative Rate, FNR): FNR = FN/(TP + FN) (number of predicted Negative positive sample results/actual number of positive samples)

False Positive Rate (False Positive Rate, FPR): FPR = FP/(FP + TN) (number of predicted Positive negative samples/actual number of negative samples)

True Negative Rate (TNR) or specificity: TNR = TN/(TN + FP) (number of Negative sample prediction results/actual number of Negative samples)

# Set the threshold value to 0.5 and calculate the preceding confusion matrix TP = sum (credit ['model _ score ']> 0.5) = 1) & (credit ['paid'] = 1) FN = sum (credit ['model _ score '] <= 0.5) = 1) & (credit ['paid'] = 1) FP = sum (credit ['model _ score ']> 0.5) = 1) & (credit ['paid'] = 0) TN = sum (credit ['model _ score '] <= 0.5) = 1) & (credit ['paid'] = 0 ))

As long as the actual rate (TPR) in our model is greater than the false positive rate (FPR), we can ensure that there are more people than those who default, so that the bank will not lose money.

ROC curve Calculation

The ROC curve is also known as the sensitiequalcurve curve ). The reason for this is that the points on the curve reflect the same sensitivity. They are all responses to the same signal stimulus, but only the results obtained under several different criteria. The receiver operation characteristic curve is a coordinate chart composed of the false probability as the horizontal axis and the hit probability as the vertical axis, the curve drawn from different results obtained by different judgment criteria under specific stimulus Conditions

As described above, we are now looking for a threshold to make the true rate greater than the false positive rate

Import numpydef roc_curve (observed, probs): # divide the threshold value from 1 to 0 into 100 decimal places thresholds = numpy. asarray ([(100-j)/100 for j in range (100)]) # initialize to all 0 fprs = numpy. asarray ([0. for j in range (100)]) tprs = numpy. asarray ([0. for j in range (100)]) # cycle each threshold value for j, thres in enumerate (thresholds ): pred = probs> thres FP = sum (observed = 0) & (pred = 1) TN = sum (observed = 0) & (pred = 0) FPR = float (FP/(FP + TN) TP = sum (observed = 1) & (pred = 1 )) FN = sum (observed = 1) & (pred = 0) TPR = float (TP/(TP + FN )) fprs [j] = FPR tprs [j] = TPR return fprs, tprs, thresholdsfpr, tpr, thres = roc_curve (credit ["paid"], credit ["model_score"]) idx = numpy. where (fpr> 0.20) [0] [0] # select a threshold with a false positive rate of 0.2 for the following print ('fpr: 100') print ('tpr: {}'. format (tpr [idx]) print ('threashold :{}'. format (thres [idx]) # use false positive rate as the X axis, and use True rate as the Y axis to plot plt. plot (fpr, tpr) plt. xlabel ('fpr') plt. ylabel ('tpr ') plt. show ()

It indicates that when the threshold is set to 0.38, FPR = 0.2, TPR = 0.93

# A simple example from sklearn. metrics import roc_auc_scoreprobs = [0.98200848, 0.92088976, 0.13125231, 0.0130085, 0.35719083, 0.34381803, 0.46938187, 0.53918899, 0.63485958] obs = [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1] testing_auc = roc_auc_score (obs, probs) print ("Example AUC: {auc }". format (auc = testing_auc ))

Auc = roc_auc_score (credit ["paid"], credit ["model_score"])

# Calculate pred = credit ["model_score"]> 0.5 # True PositivesTP = sum (pred = 1) & (credit ["paid"] = 1) # False PositivesFP = sum (pred = 0) & (credit ["paid"] = 1) # False NegativesFN = sum (pred = 1) & (credit ["paid"] = 0) precision = TP/(TP + FP) recall = TP/(TP + FN) print ('precision: {}'. format (precision) print ('recall :{}'. format (recall ))

From sklearn. metrics import precision_recall_curvepretries, recall, thresholds = precision_recall_curve (credit ["paid"], credit ["model_score"]) plt. plot (recall, precision) plt. xlabel ('recall') plt. ylabel ('precision ') plt. show ()

In the middle, the curve suddenly drops when Recall = 0.8. At this time, Precision = 0.9 indicates that only a small number of potential customers are lost, and the default rate is also very low.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine Learning Quick Start (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine Learning Quick Start (2)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support