Handling skewed data---trading off precision and recall

Last Update:2016-01-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Trade-offs between Preision and recall

Is still an example of cancer prediction, when predicted as cancer, y=1; generally as a logistic regression we are hθ when >=0.5 (x) Y=1;

When we want to be more confident in predicting cancer (for patients who say they have cancer that will have a significant effect on them, let them go to therapy, so be more sure to tell the patient cancer predictions): We can set the threshold to 0.7, At this point we will have a high precision (because the cancer is very confident), and a low value recall, if threshold is set to 0.9---> High precision, and a low value recall

When we want to avoid missing patients with cancer (avoid false negatives, i.e. we do not want a patient to have a cancer, but we do not tell him, delaying his treatment): Set threshold to 0.3, when we get a low precision (there are many cancer that are actually mistaken) and a high recall (because most of the cancer are labeled).

So for most regression models, we need to weigh precision and recall.

The Precision&recall curve (which changes with the change of threshold), as shown on the right, has a number of possibilities for the Precision&recall curve, depending on the specific algorithm.

So can we automatically pick the right threshold?

How to choose the right threshold

The threshold value of the above three algorithms is different, that is, precision and recall value are different, then we should choose which of the above three models? ----We need an evaluation question value (evaluation metric) to measure.

Precison and recall cannot be evaluation metric, because they are different two numbers (this is the elimination).

If we use the average to do this evaluation metric: we can see that the average value of algorithm 3 is the largest, but the algorithm 3 is not a good algorithm, because we can predict all Y to 1 (will be threshold down) to achieve high recall, Low precision, which is obviously not a good algorithm, but it has a very good average, so we can not use average as evaluation metric.

F score (or F1 score): Used in machine learning to measure precision and recall evaluation metric (used to select threshold), when Precison and recall have an hour, The F-value obtained by this formula is also very small, which prevents the error that we mentioned above by using the average to measure. That is, as long as the F value is large, then the precision and recall are larger.

If precision or recall has a value of 0 for 0,f, and if it is a perfect model, that is, precision and recall are 1, then the F value is 1, so the range of F values in reality is between 0-1.

Summarize

Trade-offs between precision and recall (change their values by changing the threshold)
Different threshold correspond to different precison and recall, how to choose the appropriate threshold to get a good model (through the F-value on the cross validation set model selection)
If you want to select threshold automatically, try a series of different thresholdand select them on the cross validation

Handling skewed data---trading off precision and recall

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More