These concepts were known a long time ago, but they are often forgotten because they are not the same as their cognitive habits. So simply to summarize these concepts, and then forget to look for (other articles too verbose, computational methods are also written not clear ...)
In addition, I will continue to update some other machine learning related concepts and indicators, that is, convenient for themselves, but also for others.
Note: This article will be mixed with positive and negative (+) negative (-) two sets of statements Zhenyang rate, false positive rate
These concepts are actually introduced from the medical side into machine learning, so the logic of their thinking will be more or less with the machine learning a little. We go to see a doctor, the test sheet or report will appear (+) with (-), the phenotype is positive and negative respectively. For example, you go to check if you have some kind of disease, positive (+) on the explanation, negative (-) it means nothing.
So, the test is not reliable. When designing this test, researchers want to know if the person is really sick, then what is the probability that the method can be checked out (Zhenyang rate)? If the person does not get sick, then this method of misdiagnosis of the probability of illness (false Yang rate).
Specifically, look at the following table (from Baidu Encyclopedia):
Zhenyang Rates (True Positive rate, TPR) are:
Zhenyang rate =aa+c Zhenyang rate =aa+c
The meaning is the number of true positive samples detected divided by all true positive samples.
The false positive rate (false Positive rates, FPR) is:
False positive rate =bb+d false Yang rate =bb+d
The meaning is the number of false positive samples detected divided by all true negative samples.
ROC (Receiver Operating characteristic)
Very simple, is the false yang rate when the x-axis, Zhenyang rate when the y-axis to draw a two-dimensional planar Cartesian coordinate system. Then constantly adjust the threshold of the detection method (or classifier in machine learning), that is, the final score above a certain value is positive, and vice versa is negative, get different Zhenyang rate and the number of false yang rate, and then the description point. You can get a ROC curve.
It is important to note that the ROC curve is bound to be (0,0), ending at (a). Because, when all is judged negative (-), it is (0,0); all is positive (+). This two-point line with a slope of 1 indicates a random classifier (with no distinction between real positive and negative samples). So the general classifier needs to be above this line.
The drawing is probably long below this (turn from here):
AUC (area under Curve)
As the name implies, it is the area below the ROC curve. The closer to 1 means that the classifier is better.
However, it is cumbersome to calculate the AUC directly, but because it is equivalent to Wilcoxon-mann-witney test, the AUC can be computed using this test method. Wilcoxon-mann-witney test refers to any given positive class sample and a negative class sample, the probability of the score of a positive class sample is greater than the score of the Negative class sample (score refers to the classifier's score).
Programme one:
We can for the total sample of M positive samples and n negative samples, composed of MXNMXN pair, if a pair positive sample score greater than negative samples, then 1 points, and Vice-0 points, equal to 0.5 points. Then the total score divided by MXNMXN is the value of the AUC. Complexity O (MXN) O (MXN)
Scenario two:
The basic idea, but the complexity can be reduced to O ((m+n) log (M+n) O ((m+n) log (m+n)).
First, we sort all the sample scores from large to small, the highest ranked sample rank is M+n, the second is m+n-1, and so on. We then sums all the positive samples with the idea that the positive sample of the rank k is at most larger than the score of the K-1 negative sample. When we add the rank of the positive sample and then subtract (1+m) M/2 (1+m) M/2, that is, the number of positive samples, that is, the positive sample score than the negative sample score large pair number. Divided by O (MXN) O (MXN) is the value of the AUC, the formula is as follows:
Auc=∑i∈positiveran