Using Python to draw ROC curve and AUC value calculation, rocauc
Preface
The ROC curve and AUC are often used to evaluate the merits of a binary classifier. This article will first briefly introduce ROC and AUC, and then use an example to demonstrate how to create a ROC curve and calculate AUC in python.
AUC Introduction
AUC (Area Under Curve) is a very common evaluation indicator in the Machine Learning binary classification model, compared with the F1-Score to the project imbalance has a greater degree of attention, currently, common machine learning libraries (such as scikit-learn) generally integrate the calculation of this indicator. However, sometimes models are independently or independently written, to evaluate the quality of the training model, you have to build an AUC computing module. In this article, we found that libsvm-tools has a very easy-to-understand auc calculation, therefore, it is used for future use.
AUC Calculation
AUC calculation is divided into the following three steps:
1. Prepare the computing data. If only the training set is used during model training, cross-validation is generally used for calculation. If an evaluation set (evaluate) is available, it can be directly calculated, the data format is generally the prediction score and its target category (note that the target category is not the predicted category)
2. Obtain the horizontal (X: False Positive Rate) and vertical (Y: True Positive Rate) points based on the threshold value.
3. Connect coordinate points into a curve and calculate the area under the curve, that is, the AUC value.
Directly use python code
#! -*-Coding = UTF-8-*-import pylab as plfrom math import log, exp, sqrtevaluate_result = "you file path" db = [] # [score, nonclk, clk] pos, neg = 0, 0 with open (evaluate_result, 'R') as fs: for line in fs: nonclk, clk, score = line. strip (). split ('\ t') nonclk = int (nonclk) clk = int (clk) score = float (score) db. append ([score, nonclk, clk]) pos + = clk neg + = nonclk db = sorted (db, key = lambda x: x [0], reverse = True )# Calculate ROC coordinate point xy_arr = [] tp, fp = 0 ., 0. for I in range (len (db): tp + = db [I] [2] fp + = db [I] [1] xy_arr.append ([fp/neg, tp/pos]) # Calculated area auc under the curve = 0. prev_x = 0for x, y in xy_arr: if x! = Prev_x: auc + = (x-prev_x) * y prev_x = xprint "the auc is % s. "% aucx = [_ v [0] for _ v in xy_arr] y = [_ v [1] for _ v in xy_arr] pl. title ("ROC curve of % s (AUC = %. 4f) "% ('svm ', auc) pl. xlabel ("False Positive Rate") pl. ylabel ("True Positive Rate") pl. plot (x, y) # use pylab to plot x and ypl. show () # show the plot on the screen
For the input dataset, see svm prediction results.
The format is:
nonclk \t clk \t score
Where:
1. nonclick: indicates the number of negative samples for unclicked data.
2. clk: Number of clicks, which can be viewed as the number of positive samples
3. score: predicted score. Using this score as the group for pre-Statistics of positive and negative samples can reduce the calculation workload of AUC.
The running result is:
If pylab is not installed on the local machine, you can directly comment out the dependency and drawing part.
Note:
The above code:
1. Only binary classification results can be calculated (as for binary classification tags, they can be processed as needed)
2. In the above Code, each score has a threshold value. In fact, the efficiency is quite low. You can sample the sample or perform an equal score calculation when calculating the horizontal axis coordinates.
Summary
The above is all the content of this article. I hope the content of this article will help you in your study or work. If you have any questions, please leave a message.