AUC is a standard used to measure the quality of a classification model.
ROC analysis is a new performance evaluation method for classification models from the medical analysis field.
The full name of ROC is called ROC operating characteristic. Its main analysis tool is ROC curve, a curve drawn on a two-dimensional plane. The horizontal coordinate of the plane is false positive rate (FPR), and the vertical coordinate is true positive rate (TPR ). For a classifier, we can obtain a TPR and FPR point based on its performance in the test sample. In this way, the classifier can be mapped to a point on the ROC plane. By adjusting the threshold value used for classifier classification, we can obtain a curve that goes through (0, 0) and (1, 1). This is the ROC curve of the classifier. Generally, this curve must be above (0, 0) and (1, 1) connections. Because the ROC curve formed by the line (0, 0) and (1, 1) actually represents a random classifier. Although ROC curve is used to express the performance of classifier, It is intuitive and easy to use. However, people always want to have a number to mark the quality of the classifier. Then area under ROC curve (AUC) appears.
As the name suggests, the AUC value is the size of the area under the ROC curve. Generally, the AUC value is between 0.5 and 1.0, and the larger AUC indicates better performance.
Summary of AUC calculation methods:
(The AUC value is the area under the ROC curve)
Directly calculating AUC is very troublesome, So it uses a property of AUC (which is equivalent to Wilcoxon-Mann-Witney test) for computation. Wilcoxon-Mann-Witney test is used to test the probability of giving a positive sample and a negative sample to a positive sample. With this definition, we can obtain another method for calculating AUC: to obtain this probability.
Method 1: Calculate all positive and negative sample pairs (M is the number of positive samples and N is the number of negative samples, the score of the positive sample in the number of groups is greater than that of the negative sample. When the scores of positive and negative samples in the binary group are equal, the score is calculated as 0.5. Divide by Mn. The complexity of this method is O (n ^ 2 ). N indicates the number of samples (n = m + n ).
The second method is actually the same as the above method, but the complexity is reduced. It first sorts the scores from the largest to the smallest, then sets the rank of the sample corresponding to the largest score to N, and the rank of the sample corresponding to the second largest score to n-1, and so on. Then, sum the rank of all positive samples, and subtract the score of the positive sample from the m value that is the smallest. The result is that the score of all positive samples is greater than that of negative samples. Then divide by m×n. That is
AUC = (add all positive examples)-M * (m + 1)/(m * n)
In addition, when scores are equal, you need to assign the same rank (whether the same score is between the same sample or different classes of samples, you need to process it like this ). The specific operation is to take the rank of all the samples with the same score as the average. Then use the above formula.
Reference: http://blog.csdn.net/chjjunking/article/details/5933105
AUC (area under ROC curve) Study Notes