Roc_curve () function analysis of Sklearn __roc

Source: Internet
Author: User
Tags diff

When using the Sklearn Roc_curve () function, it is found that the returned results are not the same as imagined, theoretically threshold should take all y_score (i.e. model predictive values). But the results of roc_curve () only output part of the threhold. From the source found the reason.

Initial data:

Y_true = [0, 0, 1, 0, 0, 1, 0, 1, 0, 0]
y_score = [0.31689620142873609, 0.32367439192936548, 0.42600526758001989, 0.38 769987193780364, 0.3667541015524296, 0.39760831479768338, 0.42017521636505745, 0.41936155918127238, 0.33803961944475219, 0.33998332945141224]

The Roc_curve function of Sklearn evaluates to false positive rate and true positive rate and corresponding threshold:

FPR_SKL, tpr_skl, thresholds_skl = Roc_curve (Y_true, Y_score, Drop_intermediate=false)

The calculated values are as follows:

FPR_SKL
[0.          0.14285714  0.14285714  0.14285714  0.28571429  0.42857143  0.57142857 0.85714286  1.        ]

TPR_SKL
[0.          0.14285714  0.14285714  0.14285714  0.28571429  0.42857143
  0.57142857  0.71428571  0.85714286  1.        ]

thresholds_skl
[0.42600527  0.42017522  0.41936156  0.39760831  0.38769987  0.3667541
  0.33998333  0.33803962  0.32367439  0.3168962]
Roc_curve () function

Analyze the Roc_curve () code to see how these three values are calculated, in fact, is the general AUC calculation process.

The first is the _binary_clf_curve () function:

    FPS, TPS, thresholds = _binary_clf_curve (
        y_true, Y_score, Pos_label=pos_label, Sample_weight=sample_weight)

FPS and TPS are the values of the FP and TP in the confusion matrix; thresholds is the result of y_score in reverse order (because of the number of decimal places to keep, so the surface looks different, in fact, is the same). In this example, the value is as follows:

fps = [0, 1, 1, 1, 2, 3, 4, 5, 6, 7] The TPS = [1, 1, 2, 3, 3, 3, 3, 3, 3, 3
]
thresholds = [0.42600526758001989, 0.420 17521636505745, 0.41936155918127238, 0.39760831479768338, 0.38769987193780364, 0.3667541015524296, 0.33998332945141224, 0.33803961944475219, 0.32367439192936548, 0.31689620142873609]

For ease of understanding, the calculation of FPS and TPS is achieved in a more intuitive way:

For threshold in thresholds:
    # is greater than or equal to threshold 1, otherwise 0
    y_prob = [1 if i>=threshold else 0 for I in Y_score]
    # results are correct Result
    = [i==j to I,j in Zip (Y_true, Y_prob)]
    # is predicted to be a positive class
    positive = [i==1 for i in Y_prob]

    TP = [I and J For i,j in zip (result, positive)] # prediction is positive class and predictive correct
    fp = [(Not i) and J for I,j in zip (result, positive)] # predicted to be a positive class and predictive error 
  
   print (Tp.count (True), Fp.count (True))

# output
0 1
1 1
1
2 1 3 2 3 3 3 4 3 5 3
6 3
7 3
  

Through FPS and TPS, you can calculate the corresponding FPR and TPR, of which-1 is the minimum threshold, that is, all samples are judged as positive, correspondingly, fps[-1] is the sum of negative samples, tpr[-1 is the sum of positive samples. The source code for the corresponding calculation is simplified as follows:

FPR = [I/fps[-1] for I (FPS)] # Fps/fps[-1]
TPR = [i/tps[-1] for I in TPS] # Tps/tps[-1]
drop_intermediate Parameters

Roc_curve () function has the drop_intermediate parameter, the corresponding source code is:

If Drop_intermediate and Len (fps) > 2:
    optimal_idxs = Np.where (Np.r_[true,
                                  np.logical_or (fps, 2),
                                                Np.diff (TPS, 2)),
                                  True] [0]
    fps = fps[optimal_idxs]
    TPS = Tps[optimal_idxs]
    thresholds = THRESHOLDS[OPTIMAL_IDXS]

In this example, the value of the corresponding variable is:

# Take two order difference
Np.diff (fps, 2)
[-1 0 1 0 0 0 0 0  ]
Np.diff (TPS, 2)
[1  0 -1  0  0  0  0  0]

# Fetch or
np.logical_or (Np.diff (FPS, 2), Np.diff (TPS, 2))
[True, False, True, False, False, False, False, False  ]

# Adds a True np.r_[true to the top and the tail
, Np.logical_or (Np.diff (FPS, 2 ), Np.diff (TPS, 2)), True]
[true,  true, False,  true, False, False, False, False, False,  true]

# True is the array subscript
np.where (np.r_[true, Np.logical_or (Np.diff (FPS, 2), Np.diff (TPS, 2), True]) [0]
[0, 1, 3, 9]

Optimal_idxs In fact is the ROC image inflection point, for drawing, only need inflection point. To imagine FPS and TPS as a person's displacement on a graph, the first-order difference is "moving speed" and the second-order difference is "acceleration".

"Roc image" is as follows:

fps = [0, 1, 1, 1, 2, 3, 4, 5, 6, 7] The TPS = [1, 1, 2, 3, 3, 3, 3, 3, 3, 3
]

plt.plot (
    fps,
    TPs,
    ' B ') )
Plt.xlim ([-1, 8])
Plt.ylim ([-1, 8])
Plt.ylabel (' TPS ')
plt.xlabel (' fps ')
plt.show ()

Therefore, the Drop_intermediate parameter is actually optimized for the ROC computing process without affecting the ROC image.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.