Sphsf-search field weight settings

Source: Internet
Author: User

There is a table that contains fields (industry, region, and position)
You need to query these three fields. Then, sort by field hits. No matter how much data a field has, if a field is hit once, different fields are multiplied by the weights of the corresponding fields.
For example, if the industry hits the weight * 1, the region hits the weight * 2, and the position hits the weight * 2
Total Weight = (industry field hit * 1) + (Region field hit * 2) + (position hit * 2)
Then sort by the total weight.

Error:

The first thing I think of is the sorting mode of "sph_sort_expr:
Cl. setsortmode (sphinxclient. sph_sort_expr, "@ industry * 1 + @ area * 2 + @ expr * 2 ");
Sphinxresult res = Cl. Query ("(@ industry building materials) | (@ area Beijing) | (@ expr manager)", "idx_test ");
However, if you cannot perform mathematical operations on fields, the following error is reported:
Error: Index idx_test: Unknown identifier '@ industry' (not an attribute, not a function)
Sph_sort_expr can only operate on numbers.

Correct:

You can use $ Cl-> setrankingmode (sph_rank_proximity); // you can set the scoring mode.
Calculate weights based on field hits.
$ Cl-> setmatchmode (sph_match_extended); // sets the mode.
$ Cl-> setrankingmode (sph_rank_proximity); // you can specify the scoring mode.
$ Cl-> setfieldweights (Array ('industry '=> 1, 'region' => 2, 'expr' => 2); // set the weight of a field, if area hits, the weight is 2.
$ Cl-> setsortmode ('flac _ sort_expr ',' @ weight '); // sort by weight

AboveCodeSolved my problem.

However, the sph_rank_proximity scoring mode is not available in Java APIs. in PHP APIs, I use version 3.2.13.

The weight calculation function (currently) depends on the query mode.

The weights are calculated in the following two parts:

    1. Phrase rating,
    2. Statistical score.

The phrase score is based on the length of the document and the longest common subsequence of the query. Therefore, if a document has an exact match for a query phrase (that is, the document directly contains the phrase), the phrase score of the document gets the maximum possible value, that is, the number of words in the query.

The statistical score is based on the classic bm25 function, which only considers word frequency. If a word is rare in the entire database (that is, the low frequency word in the document set) or is frequently mentioned in a specific document (that is, the high frequency word on a specific document ), then it gets a higher weight. The final bm25 weight is a floating point number between 0 and 1.

In all modes, the phrase score of the data field is LCS multiplied by the data field weight specified by the user. The data field weight is an integer. The default value is 1, and the field weight must not be less than 1.

In sph_match_boolean mode, no weight is estimated. The weight of each matching item is 1.

In the sph_match_all and sph_match_phrase modes, the final weight is the sum of the phrase ratings.

In the sph_match_any mode, the basic idea of the above two modes is similar, but the weight of each data field is added with the number of matching words. Prior to that, the weighted phrase relevance was multiplied by an extra large enough number to ensure that any data field with a large phrase rating would make the entire match highly relevant, even if the weight of the data field is relatively low.

In sph_match_extended mode, the final weight value is the sum of weighted phrase scores and bm25 weights, multiplied by 1000 and rounded to an integer.

This behavior will be modified in the future so that the match_all and match_any modes can also use bm25.Algorithm. This will improve the search result segments with the same score for phrases, which is particularly useful in queries with only one word.

The key idea (for all modes except the Boolean mode) is that the better the sub-phrase match, the higher the score, and the highest accurate match (matching the entire phrase. The author's experience is that this phrase-based similarity-based scoring method can provide significantly higher search quality than any pure statistical model (such as bm25 widely used in other search engines.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.