The ROC curve and AUC value of the machine learning Classifier Performance Index

Source: Internet
Author: User

the classifier Performance index ROC curves, AUC value

a Roc Curve

1, ROC curve: Receiver operating characteristics (receiveroperating characteristic), each point on the ROC curve reflects the sensitivity to the same signal stimulation.

Horizontal axis : Negative positive class rate (false postive rates FPR) specificity, dividing all negative cases in the example to the proportion of all negative cases; (1-specificity)

longitudinal axis : true class rate (true postive rates TPR) sensitivity, sensitivity (positive class coverage)

2 for a two classification problem, divide the instance into a positive class (postive) or a negative class (negative). However, in the actual classification, there are four cases.

(1) If an instance is a positive class and is predicted as a positive class, that is the True class (True postive TP)

(2) If an instance is a positive class, but is predicted to be a negative class, it is a false negative class (false negative FN)

(3) If an instance is a negative class, but is predicted to be a positive class, that is, false positive class (false postive FP)

(4) If an instance is a negative class, but is predicted to be a negative class, that is, the true minus class (true negative TN)

TP: Correct number of positive

FN: False negative, no number found for correct match

FP: false positives, no matching is incorrect

TN: Number of non-matches correctly rejected

The list of tables is as follows, 1 represents the positive class, and 0 represents the negative class:

The calculation formula of horizontal and vertical axis can be obtained from the above table:

(1) True class rate (true postive rates) TPR: tp/(TP+FN), which represents the ratio of the actual positive instances in the positive class predicted by the classifier to the proportions of all positive instances. Sensitivity

(2) Negative positive class rate (False postive rates) FPR: fp/(FP+TN), which represents the proportion of the actual negative instances in the positive class predicted by the classifier. 1-specificity

(3) True negative rate (true negative rates) TNR: tn/(FP+TN), which represents the proportion of the negative instances in which the classifier predicts the actual negative instances,TNR=1-FPR. Specificity

Assuming that a logistic regression classifier is used to give the probability of a positive class for each instance, then by setting a threshold value such as 0.6, the probability is greater than or equal to 0.6 of the positive class, less than 0.6 is the negative class. The corresponding coordinates can be calculated as a set (FPR,TPR), and the corresponding coordinate points are obtained in the plane. As the threshold decreases, more and more instances are divided into positive classes, but the positive classes are also doped with real negative instances, i.e. TPR and FPR will grow at the same time. When the threshold value is maximum, the corresponding coordinate point is (0,0), when the threshold value is minimum, the corresponding coordinate point (peak).

As shown in the following picture, (a) the solid line in the figure is the ROC curve, and each point on the line corresponds to a threshold value.

The larger the transverse fpr:1-tnr,1-specificity,fpr, the more the actual negative class is predicted in the positive class.

Longitudinal axis tpr:sensitivity (positive class coverage), the larger the TPR, the more actual positive classes are predicted.

Ideal target: tpr=1,fpr=0, that is, in the figure (0,1) point, so the ROC curve closer (0,1) point, the more deviation from the 45-degree diagonal the better,sensitivity, specificity greater effect the better .

two how to draw Roc curves

Assuming that a series of samples are divided into positive classes and then sorted by size, there are 20 test samples, the "Class" column represents the true label for each test sample (p for positive samples, n for negative samples), and "score" indicates the probability that each test sample belongs to a positive sample.

Next, we start from high to low, then the "score" value as the threshold value threshold, when the test sample is the probability of a positive sample is greater than or equal to this threshold, we think it is a positive sample, otherwise a negative sample. For example, for the 4th sample in the diagram, whose "score" value is 0.6, the sample 1,2,3,4 are considered positive samples because their "score" values are greater than or equal to 0.6, while the other samples are considered negative samples. Each time a different threshold is selected, we can get a set of FPR and TPR, which is a point on the ROC curve. In this way, we get a total of 20 sets of FPR and TPR values, which are drawn in the ROC curve results such as:

AUC (area under Curve): areas under the ROC curve, between 0.1 and 1. AUC as a numerical value can be intuitively evaluated the quality of the classifier, the greater the better.

First AUC The value is a probability value, when you randomly pick a positive sample and a negative sample, the current classification algorithm based on the calculated score The probability that the value will precede the negative sample is AUC values, AUC the larger the value, the more likely the current classification algorithm is to rank positive samples in front of negative samples, which can be better categorized.

three why use Roc and the Auc Evaluation Classifier

Since there are so many standards, why use ROC and AUC? Because the ROC curve has a good feature: when the distribution of positive and negative samples in the test set is transformed, the ROC curve can remain unchanged. In the actual data set, the sample class imbalance often occurs, that is, the positive and negative sample ratio is large, and the positive and negative samples in the test data may change over time. Is the contrast between the ROC curve and the Presision-recall curve:

In, (a) and (c) are the ROC curves, (b) and (d) are precision-recall curves.

(a) and (b) show the results of classifying them in the original test set (distribution balance of positive and negative samples), (c) (d) is to increase the number of negative samples in the test concentration to 10 times times the original, the result of the classifier, it can be seen clearly that the ROC curve basically remains original, And the Precision-recall curve changes greatly.

Reference:

http://alexkong.net/2013/06/introduction-to-auc-and-roc/

http://blog.csdn.net/abcjennifer/article/details/7359370

The ROC curve and AUC value of the machine learning Classifier Performance Index

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.