Topic Center

Contact Sales

Home > Developer > Java

Lucene TF-IDF correlation score formula), lucenetf-idf

Last Update:2015-04-09 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lucene TF-IDF Correlation Formula

Lucene in keyword query, by default, using the TF-IDF algorithm to calculate the relevance of keywords and documents, using this data sorting

TF: Word Frequency, IDF: reverse Document Frequency, TF-IDF is a statistical method, or knownVector Space ModelThe name sounds complicated, but it actually only contains two simple rules.

So the TF-IDF correlation of a term is equal to TF * IDF

These two rules are very simple, this is the core rule of TF-IDF, the second rule is actually flawed, he simply thinks that the text frequency is smaller words more important, words with a high frequency of text are useless. Obviously, this is not completely correct. It does not effectively reflect the importance of words and the distribution of feature words. For example, when searching a web document, feature words in different HTML structures have different degrees of reflection on the content of the article, there should be different weights

The advantage of TF-IDF is that the algorithm is simple and fast

Lucene has expanded the preceding rules to improve programmable rows. It adds some programming interfaces and normalize weights for Different queries. However, the core formula is still TF * IDF.

The Lucene algorithm formula is as follows:

Score (q, d) = coord (q, d) · queryNorm (q) · sigma (tf (t in d) · idf (t) 2 · t. getBoost () · norm (t, d ))

Tf (t in d), = Frequency interval
Idf (t)= 1 + log (total number of documents/(number of documents containing t + 1 ))
Coord (q, d)Score factor ,. The more query items in A document, the higher the matching program for some documents, for example, querying "a B C ", the document that contains both A, B, and C3 words is divided into 3 points. The document that contains only A and B is divided into 2 points. coord can disable
Standard query of queryNorm (q) queries, so that different queries can be compared
Both t. getBoost () and norm (t, d) are programmable interfaces that allow you to adjust the weights of field/document/query items.

Various Programming Plug-ins are difficult to use, so we can simplify the score formula of Lucence.

score(q,d) = coord(q,d) · ∑ ( tf(t in d) · idf(t)2 )

Conclusion

This article address: http://lutaf.com/210.htm lutaver Original article, welcome to reprint, please attach the original article link

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

idf cabinet idf rack idf closet esp idf idf patches siem correlation lucene solr

Java's garbage collection mechanism 07-06

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene TF-IDF correlation score formula), lucenetf-idf

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support