SOLR controls the number of multiple-word federated query hits

Source: Internet
Author: User
Tags solr

1. This article deals with the two concepts in full-text search?

2. What are the criteria for sorting in many of the results?

3. How to reasonably control the number and quality of the hits in SOLR queries?

In some daily articles or some information, there are some high-frequency words, and these high-frequency words, in the participation of the query, will often result in a large number of result sets hit.
What do you mean? For example, if we are doing a restaurant search, in our index library has a column name field, most of it is XXX hotel, if you search for a XXX hotel, will be participle into:
Xxx
Restaurant
Then xxx hit only 10 result sets, and the hotel did hit 200,000 result sets, so the total result may have 20多万条, resulting in a large number of data hits, on the one hand shows the richness of information, on the other hand may cause too much confusion to users.
we analyze two important concepts in full-text search Precisionfull rate of callIn LUCENE,SOLR and elasticsearch the general word of the query results will be the two rates to do a best effect of the deployment, and this default relevance scoring rule is:
    • The most relevant scoring is in front, that is, the embodiment of the check
    • The low correlation is behind, that is, the manifestation of the whole investigation

Of Course, the above conclusion is not hundred correct, because the lucene underlying design, may lead to some strange effect, is the most accurate is not ranked first, this problem is about 10% probability, we can index two fields to avoid this problem, a word breaker, a non-participle, When querying, you can query two fields together. Back to the restaurant that question, if there is now want to search for one:
Beijing Lane Ditch North Village ten Li Xiang Hotel, the situation after participle is as follows: Lane
Gou
Bei Li
Xiaozhuang
Ten miles
Incense
Hotel notice, in the entire index library most of the data to search contains Beijing and the hotel two words, so this will almost index all the data in the query out, although the query ranking is also possible, but the hit volume is too large, more than 4 pages after almost all Beijing xxxx Hotel, with the theme of the search is not related to the So we can take some strategies to avoid this situation:
SOLR default search strategy, is the term after the word or the relationship, the final result set all return, if we change to and, that is the exact match, but one thing is, if the exact match, some time the user entered the incomplete word lost the meaning of the full text search, So we have to take a comprehensive strategy, both to ensure that the check, but also to ensure that the recall, so as to achieve?this thing directly with our full-text search framework is impossible to achieve, there is a good idea, is that we have to search for the word, extract the backbone of the sentence, and then the main part of the search, it must be fatal, if not hit, even if the data and query words, the relevance is not very good, this method is not bad, But how do you accurately present these exact backbone words in large-scale data? Using machine learning or text mining? The answer is sure to be able to do it, just need another design, which is the best solution to search for the number of hits too many ways. There is another way, is a palliative approach, it is easier to achieve, is to limit the maximum number of words after each word, that is, likeLane
Gou
Bei Li
Xiaozhuang
Ten miles
Incense
Restaurant
Must hit 3 or more term, I think the correlation is greater, or there is a percentage to limit more than 80% hits, even if this record is good. This can be solved using SOLR's edismax. two solutions, as follows:
One: Use Edismax, write in Q
Name: Beijing xxxxx Hotel after
Write in the raw Query paramters parameter
Deftype=edismax&mm=80%25 then query, MM is the minimum number of matches, can be a fixed value, it can also be a percentage. Second: In the schema.xml of SOLR, turn Solrqueryparser's defaultoperator into and


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

SOLR controls the number of multiple-word federated query hits

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.