Lucene TF-IDF correlation score formula), lucenetf-idf

Source: Internet
Author: User
Tags idf

Lucene TF-IDF Correlation Formula

Lucene in keyword query, by default, using the TF-IDF algorithm to calculate the relevance of keywords and documents, using this data sorting

TF: Word Frequency, IDF: reverse Document Frequency, TF-IDF is a statistical method, or knownVector Space ModelThe name sounds complicated, but it actually only contains two simple rules.

So the TF-IDF correlation of a term is equal to TF * IDF

These two rules are very simple, this is the core rule of TF-IDF, the second rule is actually flawed, he simply thinks that the text frequency is smaller words more important, words with a high frequency of text are useless. Obviously, this is not completely correct. It does not effectively reflect the importance of words and the distribution of feature words. For example, when searching a web document, feature words in different HTML structures have different degrees of reflection on the content of the article, there should be different weights

The advantage of TF-IDF is that the algorithm is simple and fast

Lucene has expanded the preceding rules to improve programmable rows. It adds some programming interfaces and normalize weights for Different queries. However, the core formula is still TF * IDF.

The Lucene algorithm formula is as follows:

Score (q, d) = coord (q, d) · queryNorm (q) · sigma (tf (t in d) · idf (t) 2 · t. getBoost () · norm (t, d ))

  • Tf (t in d), = Frequency interval
  • Idf (t)= 1 + log (total number of documents/(number of documents containing t + 1 ))
  • Coord (q, d)Score factor ,. The more query items in A document, the higher the matching program for some documents, for example, querying "a B C ", the document that contains both A, B, and C3 words is divided into 3 points. The document that contains only A and B is divided into 2 points. coord can disable
  • Standard query of queryNorm (q) queries, so that different queries can be compared
  • Both t. getBoost () and norm (t, d) are programmable interfaces that allow you to adjust the weights of field/document/query items.

Various Programming Plug-ins are difficult to use, so we can simplify the score formula of Lucence.

score(q,d) = coord(q,d) · ∑ ( tf(t in d) · idf(t)2 )


This article address: lutaver Original article, welcome to reprint, please attach the original article link

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.