The classification model tries to classify each instance into a specific class, and the result of the classification model is generally a real value, such as logistic regression, the result is a real value from 0 to 1. How to determine the threshold value so that the model result is greater than this value is classified as one type. If the value is smaller than this value, it is classified as another type.
Consider a binary problem, that is, dividing an instance into a positive or negative class ). For a binary problem, there are four situations. If an instance is a positive class and is predicted to be a positive class, that is, a real class (true positive). If the instance is a negative class, this is called false positive ). Correspondingly, if the instance is a negative class, it is predicted as a negative class (true positive), and the positive class is predicted as a negative class, it is false negative ).
The following table shows the join table. 1 indicates the positive class, and 0 indicates the negative class.
|
|
Prediction |
|
|
|
1 |
0 |
Total |
Actual |
1 |
True positive (TP) |
False Negative (FN) |
Actual positive (TP + FN) |
0 |
False Positive (FP) |
True negative (TN) |
Actual negative (FP + Tn) |
Total |
|
Predicted positive (TP + FP) |
Predicted negative (FN + Tn) |
TP + FP + FN + TN |
Two new terms are introduced from the join table. The first is true positive rate (TPR ),
Calculation formula:TPR = TP/(TP+FN), Which depicts the proportion of the positive instance identified by the classifier to all the positive instances. The other is the negative positive rate (FPR). The formula isFPR = FP/(FP + Tn ),The percentage of negative instances that are incorrectly recognized as positive classes to all negative instances is calculated. There is also a true negative rate (tnr), also known as specificity, with the formula tnr =TN/(Fp+TN) = 1−FPR.
In a binary classification model, for the obtained continuous results, assume that a threshold value has been determined, for example, 0.6. instances greater than this value are classified as positive, if the value is smaller than this value, it is included in the negative class. If the threshold value is reduced to 0.5, more positive classes can be identified, that is, the ratio of the identified positive examples to all positive examples is improved, that is, TPR, however, more negative instances are treated as positive instances, which improves FPR. To visualize this change, ROC is introduced here.
Extends er operating characteristic, translated as "receiver operation characteristic curve", which is sufficient for logging. A curve is a combination of two variables, 1-specificity and sensitiicity. Because 1-specificity = FPR, that is, negative positive rate. Sensitivity is the true class rate. True positive rate reflects the coverage of positive classes. This combination is 1-specificity to sensitivity, that is, the cost (Costs) to the benefit (benefits ).
The following table shows the result of a logistic regression. Divide the obtained real value into 10 identical numbers in ascending order.
Percentile |
Number of instances |
Number of positive examples |
1-specificity (%) |
Sensitivity (%) |
10 |
6180 |
4879 |
2.73 |
34.64 |
20 |
6180 |
2804 |
9.80 |
54.55 |
30 |
6180 |
2165 |
18.22 |
69.92 |
40 |
6180 |
1506 |
28.01 |
80.62 |
50 |
6180 |
987 |
38.90 |
87.62 |
60 |
6180 |
529 |
50.74 |
91.38 |
70 |
6180 |
365 |
62.93 |
93.97 |
80 |
6180 |
294 |
75.26 |
96.06 |
90 |
6180 |
297 |
87.59 |
98.17 |
100 |
6177 |
258 |
100.00 |
100.00 |
The number of positive examples is the actual positive number in this part. That is to say, the results obtained by logistic regression are arranged in ascending order. If the previous 10% values are used as the threshold value, the first 10% instances are classified as positive classes and 6180 instances. Among them, the correct number is 4879, accounting for 4879/14084 * 100% = 34.64% of all positive classes, that is, sensitivity. In addition, 6180-4879 = 1301 negative instances are incorrectly classified as positive classes, 1301/47713 * 100% = 2.73% of all negative classes, that is, 1-specificity. Use these two sets of values as the x and y values respectively, and use the scatter chart in Excel. The ROC curve is as follows:
The diagonal lines reflect the result of a random selection, which serves as the control line. How can we choose the threshold? This involves AUC (area under the ROC curve, area under the ROC curve ).