yu Code Comments
Machine learning (ML), Natural language Processing (NLP), Information Retrieval (IR) and other fields, evaluation (Evaluation) is a necessary work, and its evaluation indicators tend to have the following points: accuracy (accuracy), accuracy (Precision), Recall (Recall) and F1-measure. (Note: In contrast, the IR ground truth is often a Ordered List, not a Bool type of Unordered Collection, in the case of all found, ranked third or fourth place loss is not very large, and ranked in the first and 100th Names, although they are "found," but the meaning is different, so more likely to apply to evaluation indicators such as MAP. )
This article will briefly describe several of these concepts. In Chinese, the translation of these evaluation indicators is different, so in general, the use of English is recommended.
Now I'm going to assume a specific scenario as an example.
If a class has a boy, a girl , a total of the people. The goal is to find all the girls.
Now someone to pick out a person, of whom is a girl, in addition to the wrong 30 boys also as a girl selected.
As the evaluator you need to evaluate (evaluation) under his work
First we can calculate the accuracy rate (accuracy), which is defined as: for a given test data set, the classifier correctly classifies the ratio of the number of samples to the total number of samples. That is, the loss function is a 0-1 loss when the accuracy rate on the test data set [1].
This sounds a bit abstract, simply put, in front of the scene, the actual situation is that the class has male and female two categories, someone (that is, the definition of the classifier) he divided the class of men and women into two categories. What accuracy needs to get is the proportion of people who are right in the total . It's easy for us to get: he decided that 70 (20 female +50 men) were correct and the total number was 100, so its accuracy was 70 (70/100).
By accuracy, we can indeed get a classifier in some sense to be effective, but it is not always effective to evaluate the work of a classifier. For example, Google crawled ARGCV 100 pages, and it has a total of 10,000,000 pages in the index, randomly draw a page, category, this is not the ARGCV page? If I take accuracy to judge my work, I will judge all the pages as " Not ARGCV's page ", because I'm so efficient (return false, a word), and accuracy has reached 99.999% (9,999,900/10,000,000), and after a lot of other classifiers have worked hard to calculate the value, And my algorithm is obviously not the demand, how to solve it? This is the time for Precision,recall and f1-measure to play.
Before we talk about Precision,recall and f1-measure, we need to define the Tp,fn,fp,tn four classification cases first.
According to the previous example, we need to find all the girls from a class, if the task as a classifier, then the girl is what we need, and the boys are not, so we call the girls as "positive class", while the boys are "negative class".
|
Correlation (relevant), positive class |
Irrelevant (nonrelevant), negative class |
Retrieved (retrieved) |
True positives (TP positive class is judged to be a positive class, the example is the correct decision "This is a Girl") |
False positives (FP negative class to determine the positive class, "save Pseudo", the example is clear is the boys are judged to be girls, now pseudo-Niang rampant, this mistake often committed) |
Not retrieved (not retrieved) |
False negatives (FN positive class is judged to be negative class, "Go to the truth", the example is, is clearly a girl, this man is judged as a boy----------The fault of the students is this) |
True negatives (TN negative class is judged negative class, that is, a boy is judged as a boy, a pure man like me eaten will be here) |
With this table, we can easily get these values:
Tp=20
Fp=30
Fn=0
Tn=50
The formula for accuracy (precision) is that it calculates the proportion of all "correctly retrieved item (TP)", which is the percentage of all "actually retrieved (TP+FP)".
In the example, you want to know the proportion of the right person (that is, the girl) who is the one who gets it. So its precision is 40% (20 girls/(20 girls and +30 male students)).
The recall rate (recall) formula is that it calculates all "correctly retrieved item (TP)" as a percentage of all "item (TP+FN)" That should be retrieved.
In the example is to want to know this June get girls accounted for all the girls in this class ratio, so its recall is 100% (20 girls/(20 girls + 0 male students))
The F1 value is the harmonic mean of the exact value and the recall rate, i.e.
The adjustment is also
In the example, F1-measure is about 57.143% ().
It should be noted that someone [2] listed such a formula
Generalize the f-measure.
F1-measure that the accuracy and recall weights are the same, but in some scenarios we might think that accuracy is more important, adjusting parameter A and using Fa-measure to help us better evaluate results.
Although a lot of words, actually achieve very easy, click here to see a simple implementation of my.
References
[1] Hangyuan li. Statistical learning methods [M]. Beijing: Tsinghua University Press, 2012.
[2] accuracy rate (Precision), recall rate (Recall) and comprehensive Evaluation Index (F1-MEASURE)
==================================================
Own understanding + My teacher's argument is that the accuracy rate is to find the right , recall is to find the whole.
Probably you ask a model, this pile of things is not a certain class of time, the accuracy is that it says yes, this thing is really the probability of it, the recall rate is, it said yes, but it misses (1-recall rate) so much .
==================================================
In the information retrieval, classification system, there are a series of indicators, to find out these indicators for the evaluation of search and classification performance is very important, so recently according to the blog of netizens made a summary.
Accuracy, recall rate, F1
Information retrieval, classification, identification, translation and other fields two most basic indicators are the recall rate (Recall rates) and accuracy (Precision rate), recall rate is also called recall rate, accuracy is also called precision ratio, concept formula :
Recall rate (Recall) = The total number of related files/systems retrieved by the system
Accuracy (Precision) = The total number of retrieved files/systems retrieved by the system
The diagram shows the following:
A: retrieved, related
A:(The search also wants)
B: retrieved, but irrelevant (found but useless)
C: not retrieved, but relevant (not found, but actually wanted)
D: Not retrieved, also irrelevant (no search is useless)
Note: The accuracy and recall are mutually affected, ideally it must be done both high, but in general the accuracy is high, the recall rate is low, the recall rate is low, the accuracy is high, of course, if both are low, that is where the problem . In general, the accuracy and recall rates of a set of different thresholds are calculated using different thresholds, such as:
If it is a search, it is to ensure that the recall of the situation to improve the accuracy rate, if the disease monitoring, anti-litter, it is the accuracy of the conditions to improve the recall.
Therefore, in the case of both requirements high, can be measured by F1.
[Python]View Plaincopy
- F1 = * p * r/(P + r)
The formula is basically like this, but how to figure 1 of a, B, C, d? This requires manual labeling, manual labeling of data requires more time and boring, if only to do experiments can be used with ready-made corpus. Of course, there is a way to find a more mature algorithm as a benchmark, using the results of the algorithm as a sample to be compared , this method is also a bit problematic, if there is a good algorithm available, there is no need to study.
APS and maps (mean Average Precision)
Map is to solve the p,r,f-measure of the single point value limitations. In order to obtain an index which can reflect the global performance, it can be seen that the distribution of two curves (block points and dots) corresponds to the accuracy rate-recall curve of two retrieval systems.
As can be seen, although the performance curve of the two systems overlap but the performance of the dot-marked system in most cases is far better than the system marked with a block.
From this we can find that if a system has better performance, its curve should be as prominent as possible.
More specifically, the larger the area between the curve and the axis.
Ideally, the system should contain an area of 1, and all systems should have an area greater than 0. This is the most common performance index used to evaluate the information retrieval system, and the average accuracy map is defined as follows: (of which p,r is the accuracy and recall rate respectively)
ROC and AUC
The ROC and AUC are indicators of the evaluation classifier, and the ABCD of the first figure above is still used, just a slight transformation is needed.
Returning to Roc, the ROC's full name is called receiver Operating characteristic.
ROC Focus on two indicators
True Positive Rate (TPR) = TP/[TP + FN], TPR represents the probability that a positive case can be divided into pairs
False Positive Rate (FPR) = FP/[fp + TN], FPR represents the probability of dividing a negative case into a positive case
In Roc space, the horizontal axis of each point is FPR, and the ordinate is TPR, which also depicts the trade-off between the classifier in TP (the real positive example) and the FP (the wrong positive example). The ROC's main analytical tool is a curved--roc curve painted in the ROC space. We know that for a binary classification problem, the value of the instance is often a continuous value, and we classify the instances into positive or negative classes (for example, greater than the threshold value divided into positive classes) by setting a threshold value. So we can change the threshold, according to the different thresholds to classify, according to the classification results to calculate the corresponding points in Roc space, connect these points to form the ROC curve. ROC Curve (0,0), actually (0, 0) and (1, 1), the ROC curve actually represents a random classifier. In general, this curve should be at the top of (0, 0) and (1, 1) lines.
Using ROC curve to represent the performance of the classifier is very intuitive and useful. However, people always want to have a value to mark the quality of the classifier.
So the area under Roc Curve (AUC) appeared. As the name implies, the AUC value is the size of the portion of the area below the ROC curve. In general, the value of AUC is between 0.5 and 1.0, and the larger AUC represents a better performance.
AUC Calculation tool:
http://mark.goadrich.com/programs/AUC/
P/R and ROC are two different evaluation indexes and calculation methods, in general, the former, classification, identification and other use of the latter.
Reference Links:
http://www.vanjor.org/blog/2010/11/recall-precision/
http://bubblexc.com/y2011/148/
Http://wenku.baidu.com/view/ef91f011cc7931b765ce15ec.html
: Recall, also known as "recall"--or recall good memory, can reflect its substantial significance.
Accuracy rate
Although the "recall rate" and "accuracy rate" are not necessarily related (as can be seen from the above formula), in practical applications, are mutually restrictive. To find a balance according to the actual needs.
when we ask for all the details of a certain thing in the system (Input search query word), recall refers to: the retrieval system can "recall" the details of those things, the popular word is "the ability to recall." "The number of details that can be recalled" divided by "the system knows all the details" is the "memory rate", which is the recall--recall rate. Simple, can also be understood as recall.
accuracy (accuracy), accuracy (Precision), recall rate (Recall) and F1-measure