Copyright Notice: This article from Fat Meow ~ blog, reprint must indicate the source.
Reprint Please specify source: http://www.cnblogs.com/by-dream/p/7091315.html
Objective
Recently, I intend to do a good job of translating the quality of the human evaluation.
First of all, a few words, introduce my side of the translation of the quality of human evaluation how to do. First find a batch of sentences, and then use a different engine to translate it, and then the original and the translation in the following way to render, to the professional people to score, after the score, the results of statistics, to obtain the results of the evaluation.
It seems that the process is smooth, and the results are also of reference value. However, the actual operation of the process found that if a user's ability or attitude problems, it will affect the effect of a score. So it's also a factor that we need to consider whether the evaluator is a reliable person.
By consulting The professionals, we learned that the kappa coefficients can be calibrated and used to measure the accuracy of the classification. So I decided to try it.
Well, first look at the concept of kappa coefficient and the formula of calculation.
Kappa coefficient concept
It is by multiplying the total number of cells (N) in the true classification of all the Earth's surface into the sum of the diagonal (XKK) of the confusion matrix, and subtracting the total number of real cells of a certain class of ground from the total number of such cells that are mistakenly divided into all categories, Divide by the sum of the total number of cells minus the total number of real cells in a class and the sum of the total number of cells in the category that are mistakenly divided into that class to all categories.
--From Wikipedia
Kappa calculates the result as -1~1, but usually kappa falls between 0~1 and can be divided into five groups to represent different levels of consistency: 0.0~0.20 very low consistency (slight), 0.21~0.40 General Consistency (FAIR), 0.41~0.60 Medium Consistency ( Moderate), 0.61~0.80 height consistency (substantial) and 0.81~1 are almost identical (almost perfect).
Calculation formula:
Po is the sum divided by the total number of samples of each class of correct classification, that is, the overall classification accuracy
The
assumes that the actual number of samples per class is a 1, A2 , Ac
The number of samples for each of the predicted classes is b1,b2,respectively. . ,bC
The total number of samples is n
Then there are:PE=A1XB1+A2x b2 +.+ Acxbc/ nx< Span id= "mathjax-span-101" class= "Mi" >n
Examples of operations
To better understand the process of the above operation, here is an example:
Students in the composition of the test, from two teachers to give good, medium and poor three grades, now known to the two teachers scoring results, the two teachers need to calculate the correlation kappa coefficient between scores:
Po = (10+35+15)/87 = 0.689
A1 = 10+2+8 = 20; a2 = 5+35+5 = 45; a3 = 5+2+15 = 22;
B1 = 10+5+5 = 20; b2 = 2+35+2 = 39; B3 = 8+5+15 = 28;
Pe = (a1*b1 + a2*b2 + a3*b3)/(87*87) = 0.455
K = (po-pe)/(1-PE) = 0.4293578
So we get the kappa coefficient.
Practical application
As the beginning of the same, after the real questionnaire back, I will generally be the results of the user kappa coefficient calculation before the award, because my reward price is not low, but also for the company to save costs.
General a questionnaire I will let 5 people to do, of course, the more people the more accurate, but in order to consider the cost and can get effective results, I chose 5 people here, at first my idea is to use the average of 5 people as the answer to the brick, and then let everyone's score to and average to calculate kappa, and later thought that some unreasonable , if a person is a random answer, then his results will affect the average score, thus affecting the whole result. Eventually, one person and all the people calculated the kappa directly and then averaged it. So when a man answers in a disorderly way, we can find this person when we figure out the 22 kappa, and then, when we finally calculate the average kappa, we remove the value between the owner and the person.
At first I used Python to implement the kappa coefficient calculation code, directly calculated a set of results, and then found that everyone before each other's kappa coefficient is very low, About 0.1-0.2, later analysis is due to the 5-point system resulting in data is too discrete, so for the translation engine evaluation, I will user scoring 5 points into 3 points, 1, 2 divided into one category, 2 for a category, 4, 5 for a class.
After this, of course, for one more round of insurance, one of the 5 people in each of the questionnaires had a professional evaluator I trusted very much, so I counted everyone and her direct Kappa, which ensured that the results of each score were reasonable and relevant.
Here is the Python script I implemented.
(Code ....)
。。。 You want to add
Description: Enter File * * * to join]
Here is one of the questionnaires, I calculated the "average Kappa among all people" and "the kappa between everyone and the best evaluator" is self-evident, it is clear that the winning red of the user's rating is unqualified, after my manual screening, sure enough, this user's score is indeed very unreasonable.
(to be added to the picture)
With the kappa coefficient calculation rule, we have more certainty about some of the scoring rules like this, and we know more about the accuracy and reliability of our evaluation results.
The application of kappa coefficient in evaluation