SOLR sort and document score calculation

Last Update:2017-07-12 Source: Internet

Author: User

Tags solr square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Solr

What is a document?

SOLR is a document storage and retrieval engine, and every piece of data submitted to SOLR is a document. In the SOLR schema file we can specify the name and type of the field, a document we map to a specific type of field collection by defining the schema, and each field of the document is analyzed based on its field type, and the results of the analysis are saved in the index. This will enable you to retrieve the relevant results when you launch the query.

Inverted index:

In a traditional database model, documents are mapped to content, and SOLR uses the way the index maps the content to the document.

Fuzzy query mechanism:

When a wildcard search executes, all the word items in the inverted index match the part of the query word that precedes the first wildcard character. Next, check that each candidate term matches the wildcard pattern in the query.

The more words you specify before the wildcard character, the faster the query speed, such as the engineer* execution cost is small, but the e* execution costs a lot. In SOLR it is not recommended to use the first wildcard characters, such as *ing, which can cause serious performance problems.

Default similarity:

SOLR's relevance score is based on the similarity class, and the default similarity implementation and rationale is as follows:

It checks the cosine similarity of the term vector, and if the cosine similarity of the term vector is closer to the cosine similarity of the document vector, we think that the higher the similarity is.

So how do you characterize them with well-constructed vectors?

Word frequency tf (term frequency):

We think that the more a query term appears in a document, the more relevant we think it is to this document. However, if a word appears in the document 10 times, we do not think the correlation should be increased 10 times times, so the square root is opened to reduce the number of occurrences of the query term extra points.

Reverse document frequency IDF (inverse document frequency):

In general, in query matching, we think that the rarer words have a better degree of differentiation than the common ones, and it punishes the common occurrences of words in multiple documents. (The feeling depends on the actual situation)

Word Item Weights:

In the actual search we do not have to rely entirely on SOLR to calculate scores, according to some of our experience we can adjust the weight of the terms themselves to meet our expectations.

Normalization factor:

SOLR's default correlation formula calculates three normalization factors: Field specification, query specification, and coordination factor

(1) Field specification:

where D.getboost () is the weight of the document,

F.getboost () indicates field weights

Lengthnorm (f) indicates that the length of a parameter is equal to the square root of the number of field morphemes items, in order to eliminate the advantage that a particular word item appears more frequently in a longer document,

(2) Query specification:

Querynorm applies to all documents, it does not affect the overall relevance sort, it only serves as a normalization factor for scoring calculations when compared between queries.

(3) Coordination factor:

Its role is to measure the number of queries per document matching, if the query term is 4 words, then if the 4 words all match to, then the coordination factor is 4/4, matching to 3, then the coordination factor is 3/4, and so on.

SOLR sort and document score calculation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More