When using the Sklearn Roc_curve () function, it is found that the returned results are not the same as imagined, theoretically threshold should take all y_score (i.e. model predictive values). But the results of roc_curve () only output part of the threhold. From the source found the reason.
Initial data:
Y_true = [0, 0, 1, 0, 0, 1, 0, 1, 0, 0]
y_score = [0.31689620142873609, 0.32367439192936548, 0.42600526758001989, 0.38 769987193780364, 0.3667541015524296, 0.39760831479768338, 0.42017521636505745, 0.41936155918127238, 0.33803961944475219, 0.33998332945141224]
The Roc_curve function of Sklearn evaluates to false positive rate and true positive rate and corresponding threshold:
FPR_SKL, tpr_skl, thresholds_skl = Roc_curve (Y_true, Y_score, Drop_intermediate=false)
The calculated values are as follows:
FPR_SKL
[0. 0.14285714 0.14285714 0.14285714 0.28571429 0.42857143 0.57142857 0.85714286 1. ]
TPR_SKL
[0. 0.14285714 0.14285714 0.14285714 0.28571429 0.42857143
0.57142857 0.71428571 0.85714286 1. ]
thresholds_skl
[0.42600527 0.42017522 0.41936156 0.39760831 0.38769987 0.3667541
0.33998333 0.33803962 0.32367439 0.3168962]
Roc_curve () function
Analyze the Roc_curve () code to see how these three values are calculated, in fact, is the general AUC calculation process.
The first is the _binary_clf_curve () function:
FPS, TPS, thresholds = _binary_clf_curve (
y_true, Y_score, Pos_label=pos_label, Sample_weight=sample_weight)
FPS and TPS are the values of the FP and TP in the confusion matrix; thresholds is the result of y_score in reverse order (because of the number of decimal places to keep, so the surface looks different, in fact, is the same). In this example, the value is as follows:
fps = [0, 1, 1, 1, 2, 3, 4, 5, 6, 7] The TPS = [1, 1, 2, 3, 3, 3, 3, 3, 3, 3
]
thresholds = [0.42600526758001989, 0.420 17521636505745, 0.41936155918127238, 0.39760831479768338, 0.38769987193780364, 0.3667541015524296, 0.33998332945141224, 0.33803961944475219, 0.32367439192936548, 0.31689620142873609]
For ease of understanding, the calculation of FPS and TPS is achieved in a more intuitive way:
For threshold in thresholds:
# is greater than or equal to threshold 1, otherwise 0
y_prob = [1 if i>=threshold else 0 for I in Y_score]
# results are correct Result
= [i==j to I,j in Zip (Y_true, Y_prob)]
# is predicted to be a positive class
positive = [i==1 for i in Y_prob]
TP = [I and J For i,j in zip (result, positive)] # prediction is positive class and predictive correct
fp = [(Not i) and J for I,j in zip (result, positive)] # predicted to be a positive class and predictive error
print (Tp.count (True), Fp.count (True))
# output
0 1
1 1
1
2 1 3 2 3 3 3 4 3 5 3
6 3
7 3
Through FPS and TPS, you can calculate the corresponding FPR and TPR, of which-1 is the minimum threshold, that is, all samples are judged as positive, correspondingly, fps[-1] is the sum of negative samples, tpr[-1 is the sum of positive samples. The source code for the corresponding calculation is simplified as follows:
FPR = [I/fps[-1] for I (FPS)] # Fps/fps[-1]
TPR = [i/tps[-1] for I in TPS] # Tps/tps[-1]
drop_intermediate Parameters
Roc_curve () function has the drop_intermediate parameter, the corresponding source code is:
If Drop_intermediate and Len (fps) > 2:
optimal_idxs = Np.where (Np.r_[true,
np.logical_or (fps, 2),
Np.diff (TPS, 2)),
True] [0]
fps = fps[optimal_idxs]
TPS = Tps[optimal_idxs]
thresholds = THRESHOLDS[OPTIMAL_IDXS]
In this example, the value of the corresponding variable is:
# Take two order difference
Np.diff (fps, 2)
[-1 0 1 0 0 0 0 0 ]
Np.diff (TPS, 2)
[1 0 -1 0 0 0 0 0]
# Fetch or
np.logical_or (Np.diff (FPS, 2), Np.diff (TPS, 2))
[True, False, True, False, False, False, False, False ]
# Adds a True np.r_[true to the top and the tail
, Np.logical_or (Np.diff (FPS, 2 ), Np.diff (TPS, 2)), True]
[true, true, False, true, False, False, False, False, False, true]
# True is the array subscript
np.where (np.r_[true, Np.logical_or (Np.diff (FPS, 2), Np.diff (TPS, 2), True]) [0]
[0, 1, 3, 9]
Optimal_idxs In fact is the ROC image inflection point, for drawing, only need inflection point. To imagine FPS and TPS as a person's displacement on a graph, the first-order difference is "moving speed" and the second-order difference is "acceleration".
"Roc image" is as follows:
fps = [0, 1, 1, 1, 2, 3, 4, 5, 6, 7] The TPS = [1, 1, 2, 3, 3, 3, 3, 3, 3, 3
]
plt.plot (
fps,
TPs,
' B ') )
Plt.xlim ([-1, 8])
Plt.ylim ([-1, 8])
Plt.ylabel (' TPS ')
plt.xlabel (' fps ')
plt.show ()
Therefore, the Drop_intermediate parameter is actually optimized for the ROC computing process without affecting the ROC image.