Verify that the correlation sort is dependent on how closely the query's multiple keywords are adjacent to the content

Source: Internet
Author: User
Tags idf

Yesterday to the company colleagues introduced the Lucene correlation rating formula, everyone mentioned a problem, total feeling with correlation degree, Lucene will query keyword adjacent close doc row in front, but scoring formula but not mentioned this factor, So I'm going to check to see if the severity of the query will affect the score.

Local code

Adding a doc Program

1 set Lucene to save all information about field, including word position, payloads, etc.

FieldType ty = new FieldType ();
Ty.setindexed (TRUE);
Ty.setstored (TRUE);
Ty.settokenized (TRUE);
Ty.setstoretermvectors (TRUE);
Ty.setstoretermvectoroffsets (TRUE);
Ty.setstoretermvectorpositions (TRUE);
Ty.setstoretermvectorpayloads (TRUE);
Indexoptions value = indexoptions.docs_and_freqs_and_positions_and_offsets;
Ty.setindexoptions (value);

2 word breaker Select single word participle with only one field per document

Analyzer Analyzer = new StandardAnalyzer (version.lucene_48);

Document doc = new document ();
Field f = null;
f = new Field ("content", valuestring, Ty);
Doc.add (f);

Query Doc Program

Query content also uses word word segmentation

Analyzer Analyzer = new StandardAnalyzer (version.lucene_48);

Queryparser parser = new Queryparser (version.lucene_48, field, analyzer);

Query sort Selection sort () is the sort of correlation

Topfielddocs results = searcher.search (query, 1, New Sort ());
Scoredoc[] hits = Results.scoredocs;

==================================================

I will gradually insert a few of the "Chinese" and "state" the word of the doc, and then to "China" query (participle)

To verify whether the absolute position of the term in the text of the query affects the score

I inserted the following three articles, these three in addition to "China" location is not the same, content length, TF,IDF and query terms are adjacent to the same degree, to avoid other factors affecting the sort

Mobile Online Business Office China
Mobile Online China business office
China Mobile Online Business office

The results of the query are as follows, with the same score as three, sorted in the order in which they were inserted

This indicates that the absolute position of the queried term in the text does not affect the score, and the fractions of the same doc are sorted in the insert order

Doc Hit:3
Content: Mobile Online business Office China | score:0.314803
Content: Mobile China Business | score:0.314803
Content: China Mobile Online Business | score:0.314803

==================================================

Secondly, verify whether the term of the query affects the score in the text.

I insert 3 more articles, these three content of the query terms are not the same degree of proximity

China Mobile Online Business

China Mobile network in the business Hall

Mobile network in the country office

The results of the query are as follows, the new three and the old three scores are still the same

This indicates that the term of the query is not affected by the degree of the adjacency in the text.

Doc Hit:6
Content: Mobile Online business Office China | score:0.37381613
Content: Mobile China Business | score:0.37381613
Content: China Mobile Online Business | score:0.37381613
Content: China Mobile Online Business | score:0.37381613
Content: China Mobile network in the Business Hall | score:0.37381613
Content: Mobile network in the country | score:0.37381613

PS: This time the DOC score is the same, but it is higher than the last one point, which is affected by the IDF, the IDF formula is as follows

idf(t)= 1 + log (
Numdocs
––––––––– )
Docfreq+1

Numdocs is the total number of articles, Docfreq is the number of doc that contains the query term, because each doc in the test contains the term, so the two variables are the same, the more the addition of DOC will cause the value in log to become larger and infinitely close to the 1,IDF, the score is increased

==================================================

Verify whether the term TF and article length of the query affect the score

I insert 2 more articles, this article of the query word TF high, a short length, a short length and TF high, the other is the same

China Mobile network in the business Hall

China Mobile Business

China Mobile Chinese Industry Office

The query results are as follows, with a short length and TF high DOC score, the query term TF high and the content length of the two posts below, but the score is only slightly different, the final row is inserted in the previous 6

Doc Hit:9
Content: China Mobile Industry Hall | score:0.5727169
Content: China Mobile network in the Business Hall | score:0.47726405
Content: China Mobile Office | score:0.47445422
Content: Mobile Online business Office China | score:0.3953785
Content: Mobile China Business | score:0.3953785
Content: China Mobile Online Business | score:0.3953785
Content: China Mobile Online Business | score:0.3953785
Content: China Mobile network in the Business Hall | score:0.3953785
Content: Mobile network in the country | score:0.3953785

Conclusion

Query the position of the word, the degree of tightness does not affect the score

Both the content length and the TF value will have an effect on the score, but these two factors do not have an absolute priority.

The score for the same article under the query condition is not a fixed value, because the IDF for each word is affected by the full-text file data.


Verify that the correlation sort is dependent on how closely the query's multiple keywords are adjacent to the content

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.