Lucene TF-IDF Correlation Calculation formula (RPM)

Source: Internet
Author: User
Tags idf

Lucene uses the TF-IDF algorithm to calculate the relevance of keywords and documents by default when querying a keyword, using this data to sort

TF: Word frequency, IDF: Reverse document frequencies, TF-IDF is a statistical method, or is called a vector space model , the name sounds complex, but it actually contains only two simple rules

    1. The more often a word or phrase appears in an article, the more relevant it is
    2. The less the number of documents that contain a word in the entire document collection, the more important the word is

So a term's tf-idf correlation equals TF * IDF

These two rules are very simple, this is the core rule of TF-IDF, the second rule is actually flawed, he simply think the text frequency small words the more important, the text frequency of the word is more useless, obviously this is not completely correct. Can not effectively reflect the importance of the word and the distribution of the characteristics of the word, such as the search for Web documents, in the HTML of the different structure of the characteristics of the content of the article reflected in different degrees, should have different weights

The advantage of TF-IDF is that the algorithm is simple and the operation speed is fast

Lucene in order to improve the programmable line, in the above rules do some expansion, is to add a number of programming interfaces to different queries to do a weighted normalization, but the core formula is still TF * IDF

The Lucene algorithm formula is as follows

Score (Q,D) = Coord (q,d) · Querynorm (q) · ∑ (TF (T in D) IDF (T) 2 t.getboost () norm (t,d))

    • TF (T in D), = Frequency½
    • IDF (t) = 1 +log (total documents/(number of documents containing T +1))
    • coord (q,d) scoring factor,. The more query items in a document, the higher the document matching program, for example, the query "A B C", then the document containing A/B/C3 words is 3 points, only A/b document is 2 points, coord can be turned off in query
    • Querynorm (q) queries the standard query so that different queries can be compared between
    • T.getboost () and Norm (T,d) are both available programmable interfaces that can adjust the weights of field/document/query items

A variety of programming jacks seem cumbersome and can be used without, so we can simplify the calculation of lucence formula

Score (Q,D) = Coord (q,d) · ∑ (TF (T in D) IDF (T) 2)

Conclusion
    1. TF-IDF algorithm is based on the term, the term is the smallest word breaker, which shows that the word segmentation algorithm is very important to the ranking based on statistics, if you use Chinese word segmentation, then will lose all the semantic relevance, this time the search is only as an efficient full-text matching method
    2. In accordance with rule 1 某个词或短语在一篇文章中出现的次数越多,越相关 be sure to remove stop word, because the frequency of these words is too high, that is, the value of TF is very large, will seriously interfere with the calculation of the results
    3. TF and IDF are calculated when the index is generated: TF will be saved with DocId (part of Docids), idf= total number of documents/Docids length owned by current term

This article address: http://lutaf.com/210.htm original article, welcome reprint, please attach the original link

Lucene TF-IDF Correlation Calculation formula (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.