How to reasonably control the number of hits in SOLR queries?

Last Update:2015-07-13 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In Solr, how to reasonably control the number of hits?

In some daily articles or some information, there are some high-frequency words, and these high-frequency words, in the participation of the query, will often result in a large number of result sets hit.
What do you mean? For example, if we are doing a restaurant search, in our index library has a column name field, most of it is XXX hotel, if you search for a XXX hotel, will be participle into:
Xxx
Restaurant
Then xxx hit only 10 result sets, and the hotel did hit 200,000 result sets, so the total result may have 20多万条, resulting in a large number of data hits, on the one hand shows the richness of information, on the other hand may cause too much confusion to users.

We analyze two important concepts in full-text search for the full rate of call

In LUCENE,SOLR and elasticsearch the general word of the query results will be the two rates to do a best effect of the deployment, and this default relevance scoring rule is:

The most relevant scoring is in front, that is, the embodiment of the check
The low correlation is behind, that is, the manifestation of the whole investigation

Of course, the above conclusion is not hundred correct, because the Lucene layer design, may lead to some strange effect, is the most accurate is not ranked first, this problem is about 10% probability, we can index two fields to avoid this problem, a participle, a non-participle, query time , you can query two fields together.

Back to the restaurant that question, if there is now want to search for one:
Beijing Lane Ditch North Li Zhuang Ten Li Xiang Hotel, after the situation as follows:

车道沟北里小庄十里香饭店

Note that most of the data in the entire index library to search contains Beijing and the hotel two words, so this will almost be indexed inside all the data are queried out, although the query ranking is also possible, but the hit volume is too large, more than 4 pages after almost all Beijing xxxx Hotel, with the theme of the search is not related to the So we can take some strategies to avoid this situation:
SOLR default search strategy, is the term after the word or the relationship, the final result set all return, if we change to and, that is the exact match, but one thing is, if the exact match, some time the user entered the incomplete word lost the meaning of the full text search, So we have to take a comprehensive strategy, both to ensure that the check, but also to ensure that the recall, so as to achieve?

This thing directly with our full-text search framework is impossible to achieve, there is a good idea, is that we have to search for the word, extract the backbone of the sentence, and then the main part of the search, it must be fatal, if not hit, even if the data and query words, the relevance is not very good, this method is not bad, But how do you accurately present these exact backbone words in large-scale data? Using machine learning or text mining? The answer is sure to be able to do it, just need another design, which is the best solution to search for the number of hits too many ways. There is another way, is a palliative approach, it is easier to achieve, is to limit the maximum number of words after each word, that is, like

车道沟北里小庄十里香饭店
Must hit 3 or more term, I think the correlation is greater, or there is a percentage to limit more than 80% hits, even if this record is good. This edismax can be resolved using SOLR, as follows:

Using Edismax, finish writing in Q.
Name: Beijing xxxxx Hotel after
Write in the raw Query paramters parameter
Deftype=edismax&mm=80%25

Then the query can be, MM is the minimum number of matches, can be a fixed value, also can be a percentage, because the hash is in SOLR admin page query, so need to replace the% URL character% 25, so as to correctly send to SOLR server specific information please see:

Edismax Function Introduction

How to reasonably control the number of hits in SOLR queries?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More