Lucene TF-IDF Correlation Formula
Lucene in keyword query, by default, using the TF-IDF algorithm to calculate the relevance of keywords and documents, using this data sorting
TF: Word Frequency, IDF: reverse Document Frequency, TF-IDF is a statistical method, or knownVector Space ModelThe name sounds complicated, but it actually only contains two simple rules.
So the TF-IDF correlation of a term is equal to TF * IDF
These two rules are very simple, this is the core rule of TF-IDF, the second rule is actually flawed, he simply thinks that the text frequency is smaller words more important, words with a high frequency of text are useless. Obviously, this is not completely correct. It does not effectively reflect the importance of words and the distribution of feature words. For example, when searching a web document, feature words in different HTML structures have different degrees of reflection on the content of the article, there should be different weights
The advantage of TF-IDF is that the algorithm is simple and fast
Lucene has expanded the preceding rules to improve programmable rows. It adds some programming interfaces and normalize weights for Different queries. However, the core formula is still TF * IDF.
The Lucene algorithm formula is as follows:
Score (q, d) = coord (q, d) · queryNorm (q) · sigma (tf (t in d) · idf (t) 2 · t. getBoost () · norm (t, d ))
- Tf (t in d), = Frequency interval
- Idf (t)= 1 + log (total number of documents/(number of documents containing t + 1 ))
- Coord (q, d)Score factor ,. The more query items in A document, the higher the matching program for some documents, for example, querying "a B C ", the document that contains both A, B, and C3 words is divided into 3 points. The document that contains only A and B is divided into 2 points. coord can disable
- Standard query of queryNorm (q) queries, so that different queries can be compared
- Both t. getBoost () and norm (t, d) are programmable interfaces that allow you to adjust the weights of field/document/query items.
Various Programming Plug-ins are difficult to use, so we can simplify the score formula of Lucence.
score(q,d) = coord(q,d) · ∑ ( tf(t in d) · idf(t)2 )
Conclusion
This article address: http://lutaf.com/210.htm lutaver Original article, welcome to reprint, please attach the original article link