Four ways that affect Lucene scoring documents

Source: Internet
Author: User
Tags memory usage idf

This article transferred from: http://forfuture1978.iteye.com/blog/591804

Document boost and field boost are set up in the index phase and stored in the (. nrm) file.

If you want some documents and some fields to be more important than other fields, if this document and the field that contains the word you want to query should score higher, you can set the Boost and field boost values for the document during the index phase.

These values are written to the index file during the index phase, stored in a standardized factor (. nrm) file, and cannot be changed unless the document is deleted.

If not set, document boost and field boost default to 1.

Document boost and Fieldboost are set up in the following ways:

Document doc = new document ();
Field F = new Field ("Contents", "Hello World", Field.Store.NO, Field.Index.ANALYZED);
F.setboost (+);
Doc.add (f);
Doc.setboost (100);

How do the two affect Lucene's document scoring?

Let's first look at the formula for Lucene's document scoring:

Document boost and field boost are influenced by norm (T, D), which has the following formula:

It consists of three parameters:

· Document boost: The larger the value, the more important it is.

· Field boost: The larger the field, the more important this field is.

· Lengthnorm (field) = (1.0/math.sqrt (numterms)): The more term total is contained in a domain, the longer the document, the smaller the value, and the shorter the document, the greater the value.

The third parameter can affect the score in their own similarity, which is discussed below.

Of course, can also be added when the field, set Field.Index.ANALYZED_NO_NORMS or Field.Index.NOT_ANALYZED_NO_NORMS, completely without norm, to save space.

According to Lucene's comments, No norms means that index-time field and document boosting and field length normalization is disabled. The benefit is less memory usage as norms take-one byte of RAM per indexed field for every document in the index, Durin  G Searching. Note that once the given field with norms enabled, disabling norms would have no effect. No norms means that the index phase disables the boost and length normalization of the document boost and domain. The advantage is that it saves memory and does not use one byte for each field in the index for each document in the search phase to hold norms information. However, disabling the norms information must be disabled for all domains, and once a domain becomes unavailable, other disabled domains will also hold the default norms value. Because in order to speed up the norms search, Lucene calculates the offset based on the size of the document number multiplied by the norms information of each document, and the offset will not be counted in the middle of one less document. The norms information is either saved or not saved.

The following tests can verify the role of norms information:

Test one: The role of Document boost

public void Testnormsdocboost () throws Exception {file Indexdir = new File ("Testnormsdocboost");
				IndexWriter writer = new IndexWriter (Fsdirectory.open (Indexdir), New StandardAnalyzer (Version.lucene_current), True,
		IndexWriter.MaxFieldLength.LIMITED);
		Writer.setusecompoundfile (FALSE);
		Document Doc1 = new document ();
		Field f1 = new Field ("Contents", "Common Hello Hello", Field.Store.NO, Field.Index.ANALYZED); Doc1.add (F1);Doc1.setboost (+);Writer.adddocument (Doc1);
		Document DOC2 = new document ();
		Field F2 = new Field ("Contents", "Common common Hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
		Doc2.add (F2);
		Writer.adddocument (DOC2);
		Document DOC3 = new document ();
		Field F3 = new Field ("Contents", "common common common", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
		Doc3.add (F3);
		Writer.adddocument (DOC3);
		Writer.close ();
		Indexreader reader = Indexreader.open (Fsdirectory.open (Indexdir));
		Indexsearcher searcher = new Indexsearcher (reader);
		Topdocs docs = searcher.search (New Termquery ("contents", "common"), 10);
		for (Scoredoc Doc:docs.scoreDocs) {System.out.println ("DocId:" + Doc.doc + "score:" + Doc.score); }
	}

If the first document's domain F1 is also Field.Index.ANALYZED_NO_NORMS, the search rankings are as follows:

Docid:2 score:1.2337708
Docid:1 score:1.0073696
docid:0 score:0.71231794

If the first document's domain F1 is set to Field.Index.ANALYZED, the search is ranked as follows:

docid:0 score:39.889805
Docid:2 score:0.6168854
Docid:1 score:0.5036848

Test two: The role of Field boost

If we think that title is more important than contents, you can set it up.

public void Testnormsfieldboost () throws Exception {file Indexdir = new File ("Testnormsfieldboost");
				IndexWriter writer = new IndexWriter (Fsdirectory.open (Indexdir), New StandardAnalyzer (Version.lucene_current), True,
		IndexWriter.MaxFieldLength.LIMITED);
		Writer.setusecompoundfile (FALSE);
		Document Doc1 = new document (); Field f1 = new Field ("title", "Common Hello Hello", Field.Store.NO, Field.Index.ANALYZED);F1.setboost (+);Doc1.add (F1);
		Writer.adddocument (Doc1);
		Document DOC2 = new document ();

		Field F2 = new Field ("Contents", "Common common Hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
		Doc2.add (F2);
		Writer.adddocument (DOC2);
		Writer.close ();
		Indexreader reader = Indexreader.open (Fsdirectory.open (Indexdir));
		Indexsearcher searcher = new Indexsearcher (reader); Queryparser parser = new Queryparser (version.lucene_current, "contents", New StandardAnalyzer (version.lucene_current)
		);
		Query query = Parser.parse ("Title:common Contents:common");
		Topdocs docs = searcher.search (query, 10);
		for (Scoredoc Doc:docs.scoreDocs) {System.out.println ("DocId:" + Doc.doc + "score:" + Doc.score); }
	}

If the first document's domain F1 is also Field.Index.ANALYZED_NO_NORMS, the search rankings are as follows:

Docid:1 score:0.49999997
docid:0 score:0.35355338

If the first document's domain F1 is set to Field.Index.ANALYZED, the search is ranked as follows:

docid:0 score:19.79899
Docid:1 score:0.49999997

Test three: The effect of the length of norms Chinese document on scoring

public void Testnormslength () throws Exception {file Indexdir = new File ("Testnormslength");
				IndexWriter writer = new IndexWriter (Fsdirectory.open (Indexdir), New StandardAnalyzer (Version.lucene_current), True,
		IndexWriter.MaxFieldLength.LIMITED);
		Writer.setusecompoundfile (FALSE);
		Document Doc1 = new document (); Field f1 = new Field ("Contents", "Common Hello Hello", Field.Store.NO,Field.Index.ANALYZED_NO_NORMS);
		Doc1.add (F1);
		Writer.adddocument (Doc1);
		Document DOC2 = new document (); Field F2 = new Field ("Contents", "Common common hello Hello hello hello", Field.Store.NO,Field.Index.ANALYZED_NO_NORMS);
		Doc2.add (F2);
		Writer.adddocument (DOC2);
		Writer.close ();
		Indexreader reader = Indexreader.open (Fsdirectory.open (Indexdir));
		Indexsearcher searcher = new Indexsearcher (reader); Queryparser parser = new Queryparser (version.lucene_current, "contents", New StandardAnalyzer (version.lucene_current)
		);
		Query query = Parser.parse ("Title:common Contents:common");
		Topdocs docs = searcher.search (query, 10);
		for (Scoredoc Doc:docs.scoreDocs) {System.out.println ("DocId:" + Doc.doc + "score:" + Doc.score); }
	}

When norms is disabled, the second document containing two common is scored higher:

Docid:1 score:0.13928263
docid:0 score:0.09848769

While norms works, the second document, which contains two common, has a lower score due to its long length:

docid:0 score:0.09848769
Docid:1 score:0.052230984

Test Four: Norms information is either saved, or not saved by the feature

public void Testomitnorms () throws Exception {
		file Indexdir = new File ("testomitnorms");
		IndexWriter writer = new IndexWriter (Fsdirectory.open (Indexdir),
				new StandardAnalyzer (version.lucene_current), True,
				IndexWriter.MaxFieldLength.LIMITED);
		Writer.setusecompoundfile (false);
		Document Doc1 = new document ();
		Field f1 = new Field ("title", "Common Hello Hello", Field.Store.NO,
				Field.Index.ANALYZED);
		Doc1.add (F1);
		Writer.adddocument (Doc1);
		for (int i = 0; i < 10000; i++) {
			document DOC2 = new document ();
			Field F2 = new Field ("Contents",
					"common common hello Hello hello hello", Field.Store.NO,
					 Field.Index.ANALYZED_NO_NORMS);
			Doc2.add (F2);
			Writer.adddocument (DOC2);
		}
		Writer.close ();
	}

In the search statement, set query Boost.

In search, we can specify that some words are more important to us, and we can set the boost for this word:

Common^4 Hello

Make the document containing the common a higher score than the document containing the hello.

Because in Lucene, a term is defined as field:term, it can also affect the scores of different domains:

Title:common^4 Content:common

The document that contains common in the title gets a higher score than the document containing common in the content.

Instance:

public void Testqueryboost () throws Exception {file Indexdir = new File ("Testqueryboost");
				IndexWriter writer = new IndexWriter (Fsdirectory.open (Indexdir), New StandardAnalyzer (Version.lucene_current), True,
		IndexWriter.MaxFieldLength.LIMITED);
		Document Doc1 = new document ();
		Field f1 = new Field ("Contents", "Common1 hello hello", Field.Store.NO, Field.Index.ANALYZED);
		Doc1.add (F1);
		Writer.adddocument (Doc1);
		Document DOC2 = new document ();
		Field F2 = new Field ("Contents", "Common2 common2 hello", Field.Store.NO, Field.Index.ANALYZED);
		Doc2.add (F2);
		Writer.adddocument (DOC2);
		Writer.close ();
		Indexreader reader = Indexreader.open (Fsdirectory.open (Indexdir));
		Indexsearcher searcher = new Indexsearcher (reader); Queryparser parser = new Queryparser (version.lucene_current, "contents", New StandardAnalyzer (version.lucene_current)
		); Query query = Parser.parse ("Common1 Common2");
		Topdocs docs = searcher.search (query, 10);
		for (Scoredoc Doc:docs.scoreDocs) {System.out.println ("DocId:" + Doc.doc + "score:" + Doc.score); }
	}

According to TF/IDF, a second document with two Common2 scored higher:

Docid:1 score:0.24999999
docid:0 score:0.17677669

If we enter the query statement as: "Common1^100 Common2", the first document is scored higher:

docid:0 score:0.2499875
Docid:1 score:0.0035353568

How does the query boost affect the document scoring?

Calculate the formula according to Lucene score:



Note: In the Querynorm section, there is also the part of the Q.getboost (), but the normalization of the query vector (see vector space model and lucene scoring mechanism [http://forfuture1978.iteye.com/blog/588721]).

Inherit and implement your own similarity

Similariy is the most important class for calculating Lucene scores, and implementing many of these excuses can interfere with the scoring process.

(1) Float computenorm (String field, fieldinvertstate State)

(2) Float lengthnorm (String fieldName, int numtokens)

(3) Float querynorm (float sumofsquaredweights)

(4) Float TF (float freq)

(5) Float IDF (int docfreq, int numdocs)

(6) float coord (int overlap, int maxoverlap)

(7) Float scorepayload (int docId, String fieldName, int start, int end, byte [] payload, int offset, int length)

They affect the following parts of the Lucene scoring calculation, respectively:


The following is explained one by one:

(1) Float computenorm (String field, fieldinvertstate State)

Affecting the calculation of the normalization factor, as mentioned above, he mainly contains three parts: document boost, domain boost, and document length normalization. This function is generally calculated according to the formula of norm (T, D) above.

(2) Float lengthnorm (String fieldName, int numtokens)

The main calculation of document length normalization, the default is 1.0/math.sqrt (numterms).

Because different document lengths differ in the index, it is clear that for any term, the TF in the long document is much larger and therefore the higher the score, which is unfair to the small document, an extreme example, in a 10 million-word masterpiece, "Lucene"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.