Lucene uses the TF-IDF algorithm to calculate the relevance of keywords and documents by default when querying a keyword, using this data to sort
TF: Word frequency, IDF: Reverse document frequencies, TF-IDF is a statistical method, or is called a vector space model , the name sounds complex, but it actually contains only two simple rules
- The more often a word or phrase appears in an article, the more relevant it is
- The less the number of documents that contain a word in the entire document collection, the more important the word is
So a term's tf-idf correlation equals TF * IDF
These two rules are very simple, this is the core rule of TF-IDF, the second rule is actually flawed, he simply think the text frequency small words the more important, the text frequency of the word is more useless, obviously this is not completely correct. Can not effectively reflect the importance of the word and the distribution of the characteristics of the word, such as the search for Web documents, in the HTML of the different structure of the characteristics of the content of the article reflected in different degrees, should have different weights
The advantage of TF-IDF is that the algorithm is simple and the operation speed is fast
Lucene in order to improve the programmable line, in the above rules do some expansion, is to add a number of programming interfaces to different queries to do a weighted normalization, but the core formula is still TF * IDF
The Lucene algorithm formula is as follows
Score (Q,D) = Coord (q,d) · Querynorm (q) · ∑ (TF (T in D) IDF (T) 2 t.getboost () norm (t,d))
- TF (T in D), = Frequency½
- IDF (t) = 1 +log (total documents/(number of documents containing T +1))
- coord (q,d) scoring factor,. The more query items in a document, the higher the document matching program, for example, the query "A B C", then the document containing A/B/C3 words is 3 points, only A/b document is 2 points, coord can be turned off in query
- Querynorm (q) queries the standard query so that different queries can be compared between
- T.getboost () and Norm (T,d) are both available programmable interfaces that can adjust the weights of field/document/query items
A variety of programming jacks seem cumbersome and can be used without, so we can simplify the calculation of lucence formula
Score (Q,D) = Coord (q,d) · ∑ (TF (T in D) IDF (T) 2)
Conclusion
- TF-IDF algorithm is based on the term, the term is the smallest word breaker, which shows that the word segmentation algorithm is very important to the ranking based on statistics, if you use Chinese word segmentation, then will lose all the semantic relevance, this time the search is only as an efficient full-text matching method
- In accordance with rule 1
某个词或短语在一篇文章中出现的次数越多,越相关
be sure to remove stop word, because the frequency of these words is too high, that is, the value of TF is very large, will seriously interfere with the calculation of the results
- TF and IDF are calculated when the index is generated: TF will be saved with DocId (part of Docids), idf= total number of documents/Docids length owned by current term
This article address: http://lutaf.com/210.htm original article, welcome reprint, please attach the original link
Lucene TF-IDF Correlation Calculation formula (RPM)