SOLR sort and document score calculation

Source: Internet
Author: User
Tags solr square root

Solr

What is a document?

SOLR is a document storage and retrieval engine, and every piece of data submitted to SOLR is a document. In the SOLR schema file we can specify the name and type of the field, a document we map to a specific type of field collection by defining the schema, and each field of the document is analyzed based on its field type, and the results of the analysis are saved in the index. This will enable you to retrieve the relevant results when you launch the query.

Inverted index:

In a traditional database model, documents are mapped to content, and SOLR uses the way the index maps the content to the document.

Fuzzy query mechanism:

When a wildcard search executes, all the word items in the inverted index match the part of the query word that precedes the first wildcard character. Next, check that each candidate term matches the wildcard pattern in the query.

The more words you specify before the wildcard character, the faster the query speed, such as the engineer* execution cost is small, but the e* execution costs a lot. In SOLR it is not recommended to use the first wildcard characters, such as *ing, which can cause serious performance problems.

Default similarity:

SOLR's relevance score is based on the similarity class, and the default similarity implementation and rationale is as follows:

It checks the cosine similarity of the term vector, and if the cosine similarity of the term vector is closer to the cosine similarity of the document vector, we think that the higher the similarity is.

So how do you characterize them with well-constructed vectors?

Word frequency tf (term frequency):

We think that the more a query term appears in a document, the more relevant we think it is to this document. However, if a word appears in the document 10 times, we do not think the correlation should be increased 10 times times, so the square root is opened to reduce the number of occurrences of the query term extra points.

Reverse document frequency IDF (inverse document frequency):

In general, in query matching, we think that the rarer words have a better degree of differentiation than the common ones, and it punishes the common occurrences of words in multiple documents. (The feeling depends on the actual situation)

Word Item Weights:

In the actual search we do not have to rely entirely on SOLR to calculate scores, according to some of our experience we can adjust the weight of the terms themselves to meet our expectations.

Normalization factor:

SOLR's default correlation formula calculates three normalization factors: Field specification, query specification, and coordination factor

(1) Field specification:

where D.getboost () is the weight of the document,

F.getboost () indicates field weights

Lengthnorm (f) indicates that the length of a parameter is equal to the square root of the number of field morphemes items, in order to eliminate the advantage that a particular word item appears more frequently in a longer document,

(2) Query specification:

Querynorm applies to all documents, it does not affect the overall relevance sort, it only serves as a normalization factor for scoring calculations when compared between queries.

(3) Coordination factor:

Its role is to measure the number of queries per document matching, if the query term is 4 words, then if the 4 words all match to, then the coordination factor is 4/4, matching to 3, then the coordination factor is 3/4, and so on.

SOLR sort and document score calculation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.