In Solr, how to reasonably control the number of hits?
In some daily articles or some information, there are some high-frequency words, and these high-frequency words, in the participation of the query, will often result in a large number of result sets hit.
What do you mean? For example, if we are doing a restaurant search, in our index library has a column name field, most of it is XXX hotel, if you search for a XXX hotel, will be participle into:
Xxx
Restaurant
Then xxx hit only 10 result sets, and the hotel did hit 200,000 result sets, so the total result may have 20多万条, resulting in a large number of data hits, on the one hand shows the richness of information, on the other hand may cause too much confusion to users.
We analyze two important concepts in full-text search for the full rate of call
In LUCENE,SOLR and elasticsearch the general word of the query results will be the two rates to do a best effect of the deployment, and this default relevance scoring rule is:
- The most relevant scoring is in front, that is, the embodiment of the check
- The low correlation is behind, that is, the manifestation of the whole investigation
Of course, the above conclusion is not hundred correct, because the Lucene layer design, may lead to some strange effect, is the most accurate is not ranked first, this problem is about 10% probability, we can index two fields to avoid this problem, a participle, a non-participle, query time , you can query two fields together.
Back to the restaurant that question, if there is now want to search for one:
Beijing Lane Ditch North Li Zhuang Ten Li Xiang Hotel, after the situation as follows:
车道
沟
北里
小庄
十里
香
饭店
Note that most of the data in the entire index library to search contains Beijing and the hotel two words, so this will almost be indexed inside all the data are queried out, although the query ranking is also possible, but the hit volume is too large, more than 4 pages after almost all Beijing xxxx Hotel, with the theme of the search is not related to the So we can take some strategies to avoid this situation:
SOLR default search strategy, is the term after the word or the relationship, the final result set all return, if we change to and, that is the exact match, but one thing is, if the exact match, some time the user entered the incomplete word lost the meaning of the full text search, So we have to take a comprehensive strategy, both to ensure that the check, but also to ensure that the recall, so as to achieve?
This thing directly with our full-text search framework is impossible to achieve, there is a good idea, is that we have to search for the word, extract the backbone of the sentence, and then the main part of the search, it must be fatal, if not hit, even if the data and query words, the relevance is not very good, this method is not bad, But how do you accurately present these exact backbone words in large-scale data? Using machine learning or text mining? The answer is sure to be able to do it, just need another design, which is the best solution to search for the number of hits too many ways. There is another way, is a palliative approach, it is easier to achieve, is to limit the maximum number of words after each word, that is, like
车道
沟
北里
小庄
十里
香
饭店
Must hit 3 or more term, I think the correlation is greater, or there is a percentage to limit more than 80% hits, even if this record is good. This edismax can be resolved using SOLR, as follows:
Using Edismax, finish writing in Q.
Name: Beijing xxxxx Hotel after
Write in the raw Query paramters parameter
Deftype=edismax&mm=80%25
Then the query can be, MM is the minimum number of matches, can be a fixed value, also can be a percentage, because the hash is in SOLR admin page query, so need to replace the% URL character% 25, so as to correctly send to SOLR server specific information please see:
Edismax Function Introduction
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
How to reasonably control the number of hits in SOLR queries?