The definition and use of ROC and AUC the feature engineering

Last Update:2018-08-21 Source: Internet

Author: User

Tags truncated

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Classification Model Assessment:

indicator	Description	Scikit-learn function
Precision	Precision Degree	From Sklearn.metrics import Precision_score
Recall	Recall rate	From Sklearn.metrics import Recall_score
F1	F1 value	From Sklearn.metrics import F1_score
Confusion Matrix	Confusion matrix	From Sklearn.metrics import Confusion_matrix
ROC	ROC Curve	From Sklearn.metrics Import Roc
Auc	The area under the ROC curve	From Sklearn.metrics Import AUC

Regression model Evaluation:

indicator	Description	Scikit-learn function
Mean Square Error (MSE, RMSE)	Mean variance	From Sklearn.metrics import Mean_squared_error
Absolute Error (MAE, RAE)	Absolute error	From sklearn.metrics import Mean_absolute_error, Median_absolute_error
R-squared	R-Squared value	From Sklearn.metrics import R2_score

Roc and AUC definitions

The ROC full name is "The working characteristics of the subjects" (Receiver operating characteristic). The size of the ROC curve is AUC (area Under the Curve). AUC is used to measure the "Two classification problem" machine Learning algorithm performance (generalization ability). Calculate the key concepts the ROC needs to know

First, explain the common concepts in several two classification questions: True Positive, False Positive, True Negative, False Negative. They are differentiated according to the combination of the real category and the prediction category.

Suppose there are a batch of test samples, these samples have only two categories: positive and reverse. The machine learning algorithm predicts the category as follows (the left half of the prediction category is the positive example, the right half of the prediction category is the inverse), while the real positive example category in the sample is in the upper part, and the lower part is the true counter example. The predictive value is a positive example, with the P (Positive) forecast as the inverse, the N (Negative) predictive value the same as the real value, the T (true) predicted value contrary to the true value, and F (False)

TP: The prediction category is P (positive example), and the real category is P FP: the prediction category is P, the real class is n (inverse) TN: The prediction category is n, the real category is N FN: the prediction category is n, the real category is P

The total number of true positive case categories in the sample is TP+FN. TPR is true Positive RATE,TPR = tp/(TP+FN).
Similarly, the total number of true counter cases in the sample is Fp+tn. FPR is False Positive rate,fpr=fp/(TN+FP).

There is also a concept called "truncation point". After the machine learning algorithm predicts the test sample, it can output the similarity probability of each test sample to a certain category. For example, T1 is a P-class probability of 0.3, generally we think that the probability is lower than 0.5,t1 belongs to the category N. Here's 0.5, is the "truncation point."
To sum up, for the ROC, the most important three concepts are TPR, FPR, and truncation points.

The truncation point takes different values, and the results of TPR and FPR are different. The results of the corresponding TPR and FPR of the truncated points under different values are drawn in the two-dimensional coordinate system, which is the ROC curve. The horizontal axis is expressed in FPR. Sklearn Calculation Roc

Sklearn gives an example of the ROC calculation [1].

y = Np.array ([1, 1, 2, 2])
scores = Np.array ([0.1, 0.4, 0.35, 0.8])
FPR, TPR, thresholds = Metrics.roc_curve (Y, SC Ores, pos_label=2)

1 2 3 1 2 3

by calculating, the resulting results (TPR, FPR, truncated points) are

FPR = Array ([0.,  0.5,  0.5,  1.])
TPR = Array ([0.5,  0.5,  1,  1.])
Thresholds = Array ([0.8,  0.4,  0.35,  0.1]) #截断点

1 2 3 1 2 3

The FPR and TPR in the results are drawn into two-dimensional coordinates, the ROC curve is as follows (blue Line representation), and the area of the ROC curve is represented by AUC (light yellow shaded section).

Detailed calculation Process

The above example gives the following data

y = Np.array ([1, 1, 2, 2])
scores = Np.array ([0.1, 0.4, 0.35, 0.8])

1 2 1 2

What is the process of calculating TPR,FPR with this data? 1. Analysis of Data

Y is a one-dimensional array (a true classification of the samples). Array values represent categories (altogether there are two classes, 1 and 2). We assume that 1 in Y represents a counter example, and 2 represents a positive example. Rewrite Y to:

Y_true = [0, 0, 1, 1]

1 1

Score is the probability that each sample belongs to a positive example. 2. Sort the data for score

Sample	probability of predicting a P (score)	Real Category
Y[0]	0.1	N
Y[2]	0.35	P
Y[1]	0.4	N
Y[3]	0.8	P

3. Take the truncation point to the score value in turn

The results of TPR and FPR are computed when the truncated point is sequentially 0.1,0.35,0.4,0.8. 3.1 truncation point is 0.1

Note that as long as the score>=0.1, its prediction category is the positive example.
At this point, because the score of the 4 samples is greater than or equal to 0.1, all samples have a forecast category of P.

scores = [0.1, 0.4, 0.35, 0.8]
y_true = [0, 0, 1, 1] 
y_pred = [1, 1, 1, 1]

1 2 3 1 2 3

TPR = tp/(TP+FN) = 1
FPR = fp/(TN+FP) = 1 3.2 Truncation point is 0.35

Description as long as the score>=0.35, its prediction category is P.
At this point, because 4 samples of score have 3 greater than or equal to 0.35. Therefore, the prediction classes for all samples have 3 p (2 predictions correct, 11 predictive errors), and 1 samples are predicted to be N (predicted correctly).

scores = [0.1, 0.4, 0.35, 0.8]
y_true = [0, 0, 1, 1] 
y_pred = [0, 1, 1, 1]

1 2 3 1 2 3

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More