ROC curve in Excel

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The classification model tries to classify each instance into a specific class, and the result of the classification model is generally a real value, such as logistic regression, the result is a real value from 0 to 1. How to determine the threshold value so that the model result is greater than this value is classified as one type. If the value is smaller than this value, it is classified as another type.

Consider a binary problem, that is, dividing an instance into a positive or negative class ). For a binary problem, there are four situations. If an instance is a positive class and is predicted to be a positive class, that is, a real class (true positive). If the instance is a negative class, this is called false positive ). Correspondingly, if the instance is a negative class, it is predicted as a negative class (true positive), and the positive class is predicted as a negative class, it is false negative ).

The following table shows the join table. 1 indicates the positive class, and 0 indicates the negative class.

		Prediction
		1	0	Total
Actual	1	True positive (TP)	False Negative (FN)	Actual positive (TP + FN)
Actual	0	False Positive (FP)	True negative (TN)	Actual negative (FP + Tn)
Total		Predicted positive (TP + FP)	Predicted negative (FN + Tn)	TP + FP + FN + TN

Two new terms are introduced from the join table. The first is true positive rate (TPR ),
Calculation formula:TPR = TP/(TP+FN), Which depicts the proportion of the positive instance identified by the classifier to all the positive instances. The other is the negative positive rate (FPR). The formula isFPR = FP/(FP + Tn ),The percentage of negative instances that are incorrectly recognized as positive classes to all negative instances is calculated. There is also a true negative rate (tnr), also known as specificity, with the formula tnr =TN/(Fp+TN) = 1−FPR.

In a binary classification model, for the obtained continuous results, assume that a threshold value has been determined, for example, 0.6. instances greater than this value are classified as positive, if the value is smaller than this value, it is included in the negative class. If the threshold value is reduced to 0.5, more positive classes can be identified, that is, the ratio of the identified positive examples to all positive examples is improved, that is, TPR, however, more negative instances are treated as positive instances, which improves FPR. To visualize this change, ROC is introduced here.

Extends er operating characteristic, translated as "receiver operation characteristic curve", which is sufficient for logging. A curve is a combination of two variables, 1-specificity and sensitiicity. Because 1-specificity = FPR, that is, negative positive rate. Sensitivity is the true class rate. True positive rate reflects the coverage of positive classes. This combination is 1-specificity to sensitivity, that is, the cost (Costs) to the benefit (benefits ).

The following table shows the result of a logistic regression. Divide the obtained real value into 10 identical numbers in ascending order.

Percentile	Number of instances	Number of positive examples	1-specificity (%)	Sensitivity (%)
10	6180	4879	2.73	34.64
20	6180	2804	9.80	54.55
30	6180	2165	18.22	69.92
40	6180	1506	28.01	80.62
50	6180	987	38.90	87.62
60	6180	529	50.74	91.38
70	6180	365	62.93	93.97
80	6180	294	75.26	96.06
90	6180	297	87.59	98.17
100	6177	258	100.00	100.00

The number of positive examples is the actual positive number in this part. That is to say, the results obtained by logistic regression are arranged in ascending order. If the previous 10% values are used as the threshold value, the first 10% instances are classified as positive classes and 6180 instances. Among them, the correct number is 4879, accounting for 4879/14084 * 100% = 34.64% of all positive classes, that is, sensitivity. In addition, 6180-4879 = 1301 negative instances are incorrectly classified as positive classes, 1301/47713 * 100% = 2.73% of all negative classes, that is, 1-specificity. Use these two sets of values as the x and y values respectively, and use the scatter chart in Excel. The ROC curve is as follows:

The diagonal lines reflect the result of a random selection, which serves as the control line. How can we choose the threshold? This involves AUC (area under the ROC curve, area under the ROC curve ).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

ROC curve in Excel

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

ROC curve in Excel

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support