The definition and use of ROC and AUC the feature engineering

Source: Internet
Author: User
Tags truncated

Classification Model Assessment:

indicator Description Scikit-learn function
Precision Precision Degree From Sklearn.metrics import Precision_score
Recall Recall rate From Sklearn.metrics import Recall_score
F1 F1 value From Sklearn.metrics import F1_score
Confusion Matrix Confusion matrix From Sklearn.metrics import Confusion_matrix
ROC ROC Curve From Sklearn.metrics Import Roc
Auc The area under the ROC curve From Sklearn.metrics Import AUC

Regression model Evaluation:

indicator Description Scikit-learn function
Mean Square Error (MSE, RMSE) Mean variance From Sklearn.metrics import Mean_squared_error
Absolute Error (MAE, RAE) Absolute error From sklearn.metrics import Mean_absolute_error, Median_absolute_error
R-squared R-Squared value From Sklearn.metrics import R2_score


Roc and AUC definitions

The ROC full name is "The working characteristics of the subjects" (Receiver operating characteristic). The size of the ROC curve is AUC (area Under the Curve). AUC is used to measure the "Two classification problem" machine Learning algorithm performance (generalization ability). Calculate the key concepts the ROC needs to know

First, explain the common concepts in several two classification questions: True Positive, False Positive, True Negative, False Negative. They are differentiated according to the combination of the real category and the prediction category.

Suppose there are a batch of test samples, these samples have only two categories: positive and reverse. The machine learning algorithm predicts the category as follows (the left half of the prediction category is the positive example, the right half of the prediction category is the inverse), while the real positive example category in the sample is in the upper part, and the lower part is the true counter example. The predictive value is a positive example, with the P (Positive) forecast as the inverse, the N (Negative) predictive value the same as the real value, the T (true) predicted value contrary to the true value, and F (False)

TP: The prediction category is P (positive example), and the real category is P FP: the prediction category is P, the real class is n (inverse) TN: The prediction category is n, the real category is N FN: the prediction category is n, the real category is P

The total number of true positive case categories in the sample is TP+FN. TPR is true Positive RATE,TPR = tp/(TP+FN).
Similarly, the total number of true counter cases in the sample is Fp+tn. FPR is False Positive rate,fpr=fp/(TN+FP).

There is also a concept called "truncation point". After the machine learning algorithm predicts the test sample, it can output the similarity probability of each test sample to a certain category. For example, T1 is a P-class probability of 0.3, generally we think that the probability is lower than 0.5,t1 belongs to the category N. Here's 0.5, is the "truncation point."
To sum up, for the ROC, the most important three concepts are TPR, FPR, and truncation points.

The truncation point takes different values, and the results of TPR and FPR are different. The results of the corresponding TPR and FPR of the truncated points under different values are drawn in the two-dimensional coordinate system, which is the ROC curve. The horizontal axis is expressed in FPR. Sklearn Calculation Roc

Sklearn gives an example of the ROC calculation [1].

y = Np.array ([1, 1, 2, 2])
scores = Np.array ([0.1, 0.4, 0.35, 0.8])
FPR, TPR, thresholds = Metrics.roc_curve (Y, SC Ores, pos_label=2)
1 2 3 1 2 3

by calculating, the resulting results (TPR, FPR, truncated points) are

FPR = Array ([0.,  0.5,  0.5,  1.])
TPR = Array ([0.5,  0.5,  1,  1.])
Thresholds = Array ([0.8,  0.4,  0.35,  0.1]) #截断点
1 2 3 1 2 3

The FPR and TPR in the results are drawn into two-dimensional coordinates, the ROC curve is as follows (blue Line representation), and the area of the ROC curve is represented by AUC (light yellow shaded section).

Detailed calculation Process

The above example gives the following data

y = Np.array ([1, 1, 2, 2])
scores = Np.array ([0.1, 0.4, 0.35, 0.8])
1 2 1 2

What is the process of calculating TPR,FPR with this data? 1. Analysis of Data

Y is a one-dimensional array (a true classification of the samples). Array values represent categories (altogether there are two classes, 1 and 2). We assume that 1 in Y represents a counter example, and 2 represents a positive example. Rewrite Y to:

Y_true = [0, 0, 1, 1]
1 1

Score is the probability that each sample belongs to a positive example. 2. Sort the data for score

Sample probability of predicting a P (score) Real Category
Y[0] 0.1 N
Y[2] 0.35 P
Y[1] 0.4 N
Y[3] 0.8 P
3. Take the truncation point to the score value in turn

The results of TPR and FPR are computed when the truncated point is sequentially 0.1,0.35,0.4,0.8. 3.1 truncation point is 0.1

Note that as long as the score>=0.1, its prediction category is the positive example.
At this point, because the score of the 4 samples is greater than or equal to 0.1, all samples have a forecast category of P.

scores = [0.1, 0.4, 0.35, 0.8]
y_true = [0, 0, 1, 1] 
y_pred = [1, 1, 1, 1]
1 2 3 1 2 3

TPR = tp/(TP+FN) = 1
FPR = fp/(TN+FP) = 1 3.2 Truncation point is 0.35

Description as long as the score>=0.35, its prediction category is P.
At this point, because 4 samples of score have 3 greater than or equal to 0.35. Therefore, the prediction classes for all samples have 3 p (2 predictions correct, 11 predictive errors), and 1 samples are predicted to be N (predicted correctly).

scores = [0.1, 0.4, 0.35, 0.8]
y_true = [0, 0, 1, 1] 
y_pred = [0, 1, 1, 1]
1 2 3 1 2 3

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.