Verify that the correlation sort is dependent on how closely the query's multiple keywords are adjacent to the content

Source: Internet
Author: User
Tags idf

Yesterday to the company colleagues introduced the Lucene correlation rating formula, everyone mentioned a problem, total feeling with correlation degree, Lucene will query keyword adjacent close doc row in front, but scoring formula but not mentioned this factor, So I'm going to check to see if the severity of the query will affect the score. Local Code Add DOC program1 set Lucene to save all information about field, including word position, payloads, etc.FieldType ty = new FieldType ();
Ty.setindexed (TRUE);
Ty.setstored (TRUE);
Ty.settokenized (TRUE);
Ty.setstoretermvectors (TRUE);
Ty.setstoretermvectoroffsets (TRUE);
Ty.setstoretermvectorpositions (TRUE);
Ty.setstoretermvectorpayloads (TRUE);
Indexoptions value = indexoptions.docs_and_freqs_and_positions_and_offsets;
Ty.setindexoptions (value);2 word breaker Select single word participle with only one field per documentAnalyzer Analyzer = new StandardAnalyzer (version.lucene_48);D ocument doc = new Document ();
Field f = null;
f = new Field ("content", valuestring, Ty);
Doc.add (f); Query Doc Programquery content also uses word word segmentationAnalyzer Analyzer = new StandardAnalyzer (version.lucene_48); Queryparser parser = new Queryparser (version.lucene_48, field, analyzer);Query Sort Selection sort () is the sort of correlationTopfielddocs results = searcher.search (query, 1, New Sort ());
Scoredoc[] Hits = results.scoredocs;================================================== I will insert a few doc that contains the word "China", Then make a query on "China" (participle)To verify whether the absolute position of the term in the text of the query affects the scoreI inserted the following three articles, these three articles in addition to "China" location is not the same, content length, TF,IDF and query terms are the same degree of proximity, to avoid other factors affecting the sorting mobile online business Hall China
Mobile Online China business office
China Mobile Online Business Hall query results are as follows, three scores, the same as in the order of insertionThis indicates that the absolute position of the queried term in the text does not affect the score, and the fractions of the same doc are sorted in the insert orderDoc Hit:3
Content: Mobile Online business Office China | score:0.314803
Content: Mobile China Business | score:0.314803
Content: China Mobile Online Business | score:0.314803 ==================================================Secondly, verify whether the term of the query affects the score in the text .I insert 3 more articles, these three content of the query terms are not the same degree of proximity China Mobile Online Business Hall mobile network in the upper business Hall of the Business Hall of the query results are as follows, the new three and the old three scores are still the sameThis indicates that the term of the query is not affected by the degree of the adjacency in the text . Doc Hit:6
Content: Mobile Online business Office China | score:0.37381613
Content: Mobile China Business | score:0.37381613
Content: China Mobile Online Business | score:0.37381613
Content: China Mobile Online Business | score:0.37381613
Content: China Mobile network in the Business Hall | score:0.37381613
Content: Mobile network in the country | score:0.37381613 PS: This time the DOC score is the same, but is higher than the last one point, which is affected by the IDF, the IDF formula is as follows
idf(t)= 1 + log (
Numdocs
––––––––– )
Docfreq+1


Numdocs is the total number of articles, Docfreq is the number of doc that contains the query term, because each doc in the test contains the term, so the two variables are the same, the more the addition of DOC will cause the value in log and the infinite approximation to the 1,IDF will become larger, And the score has improved ==================================================.verify whether the term TF and article length of the query affect the score I insert 2 more articles, this content of the query word TF high, a short length, a short length and high TF, the other is the same country mobile network in China Mobile Business Hall China Mobile Office of the query results are as follows, short length and TF high DOC scores, query terms TF High and content length of the short two articles, But the score is only slightly different, and the final row is the 6 doc Hit:9 inserted earlier
Content: China Mobile Industry Hall | score:0.5727169
Content: China Mobile network in the Business Hall | score:0.47726405
Content: China Mobile Office | score:0.47445422
Content: Mobile Online business Office China | score:0.3953785
Content: Mobile China Business | score:0.3953785
Content: China Mobile Online Business | score:0.3953785
Content: China Mobile Online Business | score:0.3953785
Content: China Mobile network in the Business Hall | score:0.3953785
Content: Mobile network in the country | score:0.3953785ConclusionQuery the position of the word, the degree of tightness does not affect the scoreBoth the content length and the TF value will have an effect on the score, but these two factors do not have an absolute priority.The score for the same article under the query condition is not a fixed value, because the IDF for each word is affected by the full-text file data.

Verify that the correlation sort is dependent on how closely the query's multiple keywords are adjacent to the content

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.