(1) Big data processing, cleaning data after the end, is the phenomenon analysis, and then set up model models, in verifying the effectiveness of their models
(2) Indicators for validating the validity of models with big data tests:
Accuracy (accuracy rate),Precision (precision or accuracy),Recall (recall or recall);f1-measure
True positives, true negatives, false positives, false negatives are described below respectively:
(3) detailed description: recall and precision ratio are the relatively reasonable indexes to measure the effect of retrieval at present .
Recall = (the amount of information retrieved on the relevant amount/system) *100%
Precision ratio = (the amount of information retrieved/retrieved) *100%
The former is a measure of the ability of the retrieval system and the retrieval to detect relevant information, which is the ability of the retrieval system and the retrieval to reject non-related information. Together, that means retrieval efficiency.
True Positive (real, TP) is a positive sample predicted by the model, and can be called the true rate of judgment.
True negative (true negative, TN) is a negative sample that is predicted as negative by the model, which can be called the false accuracy
False Positive (False positive, FP) is predicted by the model as positive negative sample, can be called false alarm rate
False negative (false negative, FN) is a positive sample that is predicted as negative by the model;
(4) Speak with a chart:
The translation in the table is more important true positives (Na true) false positives (Na pseudo) false negatives (De true) true negatives (go pseudo)
where false positives (Na pseudo) is also commonly referred to as false positives, false negatives is also commonly referred to as false negatives!
calculation method:
Precision = tp/(TP + FP);
Recall = TP/(TP + fn).
In practice, however, we certainly hope that the higher the result of the search, the better, the higher the r, the better; in fact, the two are contradictory in some cases. For example, we've only found one result, and it's accurate,
So P is 100%, but R is very low, and if we return all the results, then the R is 100%, but p is very low. Therefore, in different situations, you need to judge whether you want p to be higher or R.
Higher. If you are doing experiments, you can draw Precision-recall curves to help with the analysis.
Practical Examples:
For example, there are 100 data to be detected in a data set, and the results are detected by our method 10 (of which 8 are the 100 data to be detected, 2 are not), and 90 are not detected
At this time precision = 8/10 = 80%, while Recall = 8/100 = 8%;
(5) F-measure is a weighted harmonic average of precision and recall:
It is easy to understand that F1 synthesized the results of P and R, when the F1 is higher, the experimental method is more ideal.
Part excerpt from: http://blog.csdn.net/t710smgtwoshima/article/details/8215037
(6) recall and precision are limited
The limitation of recall is mainly shown in: It is the ratio of the total amount of information retrieved and stored in the retrieval system, but the amount of relevant information in the system is generally not known, only
can be estimated; In addition, recall is more or less a "hypothesis" limitations, this "hypothesis" refers to the information retrieved by the user has the same value, but the actual is not so, for the user, information related
Degree is more important than its quantity in a sense.
The limitation of precision is mainly shown in: If the search result is a problem of the record rather than the full-text, because the content of the title is simple, it is difficult for users to judge whether the information retrieved is closely related to the subject, must find the title
, in order to correctly determine whether the information meets the needs of the search project, and the relevant information in the precision ratio has the limitation of "hypothesis". The experiment proves that the memory between recall and precision
In the opposite interdependence-if you increase the recall of the output, it will reduce its precision, and vice versa.
For users, the main factors that influence the retrieval effect are the universality of the literature indexing and the reference of the user's search mark.
(7) Overview
These parameters are often used in information retrieval, such as search engines, natural language processing, and detection classifications, because there is inevitably an error in understanding the cause of the language translation.
The way of Big data processing (experimental method)