ZZ used as ROC curve in Excel

Source: Internet
Author: User

Comment from Xinwei: recently, I was helping Gao review a piece of his thesis on top of the top-cut top-down list. Seeing the ROC curve, I checked the masterpiece of Stentor and thought it was a good explanation. I would like to repost it here!

 

The classification model tries to classify each instance into a specific class, and the result of the classification model is generally a real value, such as logistic regression, the result is a real value from 0 to 1. How to determine the threshold value so that the model result is greater than this value is classified as one type. If the value is smaller than this value, it is classified as another type.

Consider a binary problem, that is, dividing an instance into a positive or negative class ). For a binary problem, there are four situations. If an instance is a positive class and is predicted to be a positive class, that is, a real class (true positive). If the instance is a negative class, this is called false positive ). Correspondingly, if the instance is a negative class, it is predicted as a negative class (true positive), and the positive class is predicted as a negative class, it is false negative ).

The following table shows the join table. 1 indicates the positive class, and 0 indicates the negative class.

Prediction

1

0

Total

Actual

1

True positive (TP)

False Negative (FN)

Actual positive (TP + FN)

0

False Positive (FP)

True negative (TN)

Actual negative (FP + Tn)

Total

Predicted positive (TP + FP)

Predicted negative (FN + Tn)

TP + FP + FN + TN

Two new terms are introduced from the join table. The first is true positive rate (TPR ), Calculation formula:TPR = TP/(TP+FN), Which depicts the proportion of the positive instance identified by the classifier to all the positive instances. The other is the negative positive rate (FPR). The formula isFPR = FP/(FP + Tn ),The percentage of negative instances that are incorrectly recognized as positive classes to all negative instances is calculated. There is also a true negative rate (tnr), also known as specificity, with the formula tnr =TN/(Fp+TN) = 1−FPR.

In a binary classification model, for the obtained continuous results, assume that a threshold value has been determined, for example, 0.6. instances greater than this value are classified as positive, if the value is smaller than this value, it is included in the negative class. If the threshold value is reduced to 0.5, more positive classes can be identified, that is, the ratio of the identified positive examples to all positive examples is improved, that is, TPR, however, more negative instances are treated as positive instances, which improves FPR. To visualize this change, ROC is introduced here.

Extends er operating characteristic, translated as "receiver operation characteristic curve", which is sufficient for logging. A curve is a combination of two variables, 1-specificity and sensitiicity. Because 1-specificity = FPR, that is, negative positive rate. Sensitivity is the true class rate. True positive rate reflects the coverage of positive classes. This combination is 1-specificity to sensitivity, that is, the cost (Costs) to the benefit (benefits ).

The following table shows the result of a logistic regression. Divide the obtained real value into 10 identical numbers in ascending order.

Percentile

Number of instances

Number of positive examples

1-specificity (%)

Sensitivity (%)

10

6180

4879

2.73

34.64

20

6180

2804

9.80

54.55

30

6180

2165

18.22

69.92

40

6180

1506

28.01

80.62

50

6180

987

38.90

87.62

60

6180

529

50.74

91.38

70

6180

365

62.93

93.97

80

6180

294

75.26

96.06

90

6180

297

87.59

98.17

100

6177

258

100.00

100.00

The number of positive examples is the actual positive number in this part. That is to say, the results obtained by logistic regression are arranged in ascending order. If the previous 10% values are used as the threshold value, the first 10% instances are classified as positive classes and 6180 instances. Among them, the correct number is 4879, accounting for 4879/14084 * 100% = 34.64% of all positive classes, that is, sensitivity. In addition, 6180-4879 = 1301 negative instances are incorrectly classified as positive classes, 1301/47713 * 100% = 2.73% of all negative classes, that is, 1-specificity. Use these two sets of values as the x and y values respectively, and use the scatter chart in Excel. The ROC curve is as follows:

The diagonal lines reflect the result of a random selection, which serves as the control line. How can we choose the threshold? This involves AUC (area under the ROC curve, area under the ROC curve ).

 

ZZ from http://www.cnblogs.com/zgw21cn/archive/2009/02/14/1390683.html#commentform

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.