Issue background
Search Keyword Smart tips is a standard search application, the main role is to prevent users from entering the wrong search terms, and to guide users to the appropriate keywords, to enhance the user search experience.
There are millions of merchants in our CRM system, in order to let users quickly find the target business, we implement the Merchant search module based on Solrcloud. In order to improve the user's search experience and input efficiency, the user mainly enters the merchant name and the merchant address to search for the merchant, this paper realizes a keyword intelligent hint (suggestion) based on SOLR prefix matching query.
Demand analysis
- Support for prefix matching principle
Enter "Sea Bottom" in the search box, under the search box will be prefixed with seabed, display "seabed fishing", "seabed fishing hot pot", "Underwater World" and so on search words, enter "Wanda", will prompt "Wanda Imax", "Wanda Plaza", "Wanda department store" and other search terms.
- Supports both Chinese and pinyin input
Because of the characteristics of Chinese, if the search automatically prompts can support pinyin will bring greater convenience to users, so as not to switch input method. For example, input "Haidi" prompt keyword and input "seabed" hint, the input "Wanda" and enter "Wanda" prompt keyword.
- Support Polyphone input prompt such as input "Chongqing" or "zhongqing" can prompt out "Chongqing hotpot", "Chongqing Grilled Fish", "Chongqing Little Swan".
- Support Pinyin abbreviation input
For longer keywords, it is necessary to provide pinyin abbreviation input in order to improve the input efficiency. For example, the input "HD" should be able to prompt "Haidi" Similar keywords, input "WD" also can be prompted "Wanda" keyword.
- Sort by keyword heat based on the user's historical search behavior
In order to provide the accuracy of the Suggest keyword, the final query results are sorted according to the frequency of the user's query keywords, such as input [Chongqing, Chongqing,cq,zhongqing,zq]
"Chongqing hotpot" (F1), "Chongqing Grilled Fish" (F2), "Chongqing Little Swan" (F3), query frequency f1 > F2 > F3.
Solution Solutions
- Keyword Collection
When the user enters a prefix, there are a lot of candidates for the prompt, how to choose, which shows in front, which shows in the back? This is a question of search heat. Users in the use of search engines to find a business, will enter a large number of keywords, each input is the keyword of a vote, then the number of keywords are entered more, it corresponds to the query is more popular, so need to query the keywords recorded down, and statistics of the frequency of each keyword, convenient to prompt the results by frequency sorting. The search engine logs all the retrieved strings used by the user each time they are retrieved through a log file, with a length of 1-255 bytes for each query string.
- Kanji to Pinyin
User input keywords may be Chinese characters, numbers, English, pinyin, special characters and so on, due to the need to implement phonetic cues, we need to convert Chinese characters into Pinyin, Java, consider using the PINYIN4J component to achieve the conversion.
- Pinyin abbreviation Extraction
Taking into account the need to support pinyin abbreviations, the Chinese character conversion pinyin process, by the way to remove pinyin abbreviations, such as "Chongqing", "zhongqing", "CQ", "ZQ".
- Polyphone full array to support the Polyphone hint, after the query string is converted to pinyin, you need to implement a full permutation combination, the string Polyphone full permutation algorithm is as follows:
PublicStaticListGetpermutationsentence (List>Termarrays,int start) {if(collectionutils.IsEmpty (Termarrays))returnCollections.Emptylist (); int size=Termarrays.Size ();if(Start< 0 ||Start>=Size) {returnCollections.Emptylist (); }if(Start==Size-1) {returnTermarrays.Get (start); }List<String>Strings=Termarrays.Get (start);List<String>Permutationsentences=Getpermutationsentence (Termarrays, start+ 1);if(collectionutils.IsEmpty (strings)) {returnPermutationsentences; }if(collectionutils.IsEmpty (permutationsentences)) {returnStrings }List<String>Result= NewArrayList<String>(); for (StringPre:strings) {for (Stringsuffix:permutationsentences) {Result.Add (pre+suffix); } }returnResult;}
Scheme one trie tree + TOPK algorithm
Trie tree is a dictionary tree, also known as the word search tree or key tree, is a tree-shaped structure, is a hash tree variant. Typical applications are used to count and sort large numbers of strings (but not limited to strings), so they are often used by search engine systems for text frequency statistics. It has the advantage of minimizing unnecessary string comparisons and querying efficiencies over hash tables. Trie is a tree that stores multiple strings. The edges between adjacent nodes represent one character, so that each branch of the tree represents a substring, and the leaf node of the tree represents the complete string. Unlike normal trees, the same string prefix shares the same branch. For example, given a set of Word Inn, int, at, age, ADV, ant, we can get the following trie:
From the point of view, when the user input prefix I, the search box may show the I prefix "in", "Inn", "int" and other keywords, and then when the user input prefix A, the search box may be prompted with a prefix "ate" and other keywords. Thus, the first step to implement the search engine intelligence hint suggestion is clear, that is, using trie tree to store a large number of strings, the current prefix fixed, storage relatively hot suffix.
The TOPK algorithm is used to solve the problem of statistical hot words. There are two main strategies for solving TOPK problems: hashmap statistics + sorting, heap ordering
HashMap Statistics: First preprocessing This batch of massive data. The method is to maintain a key is the query string, Value is the number of occurrences of the query Hashtable, that is, Hash_map (query,value), each read a query, if the string is not in the table, then add the string, And the value is set to 1, if the string in the table, then the count of the string is added to a, and finally in the time complexity of O (N) with the hash table to complete the statistics.
Heap sequencing: With the help of heap, the data structure is used to find the top K and the time complexity is n ' logk. With the help of the heap structure, we can find and adjust/move within the time of the log magnitude. Therefore, maintain a K (10) Size of the small Gan, and then traverse 3 million of the query, respectively, and the root element to compare. So, our final time complexity is: O (n) + N ' * O (LOGK), (n is 10 million, N ' is 3 million).
The problem with this scenario is:
- When you build indexes and queries, you convert Chinese characters to pinyin, and after the query is finished, you have to convert the pinyin into Chinese characters, and you need to consider numbers and special characters.
- Need to maintain pinyin, abbreviated two trie trees.
Scenario Two SOLR comes with suggest smart tips
SOLR, as a widely used search engine system, has built-in smart hints, called suggest modules. The module can choose to do smart hints based on the text of the cue word, and also supports smart hints by establishing an indexed thesaurus for a field in the index. (See the SOLR wiki page http://wiki.apache.org/solr/Suggester)
The problem with this scenario is:
The returned results are sorted based on the word frequency of the fields in the index, not how often the user searches for keywords, so you can't put some hot keywords in front of them.
Phonetic hints, polyphone, abbreviations or additional indexed fields.
Scenario three Solrcloud establish a separate collection, using SOLR prefix query implementation
As mentioned above, there are some problems in the implementation of the above two programs, Trie tree +TOPK algorithm, in the processing of Chinese characters suggest is not very elegant, and the need to maintain two trie trees, the implementation of a more complex The problem with the Suggest smart hint component in SOLR is that using the freq sorting algorithm, the returned results are based entirely on the number of occurrences of the characters in the index and do not take into account the frequency with which the user is searching for words, so it is not possible to rank some of the hot words in a more forward position. As a result, we continue to look for a more elegant solution to this problem.
At this point, we consider establishing an index collection specifically for the keyword, using the SOLR prefix query implementation. The Copyfield in SOLR solves the need to index multiple fields at the same time (kanji, pinyin, abbre), and the multivalued property of field is set to True to resolve the polyphone composition of the same keyword. The configuration is as follows:
Schema.xml:<field name="KW" Type="string"Indexed="true"Stored="true"/> <field name="Pinyin" Type="string"Indexed="true"Stored="false"Multivalued="true"/><field name="Abbre" Type="string"Indexed="true"Stored="false"Multivalued="true"/><field name="Kwfreq" Type="int"Indexed="true"Stored="true"/><field name="_version_" Type="Long"Indexed="true"Stored="true"/><field name="Suggest" Type="Suggest_text"Indexed="true"Stored="false"Multivalued="true"/>------------------multivalued indicates that the field is-------------------------------------of multivalued <uniquekey>kw</ Uniquekey><defaultsearchfield>suggest</defaultsearchfield> Description: kw for the original keyword pinyin and abbre multivalued=true, when using SOLRJ to build this index, defined as a collection type: such as the keyword "Chongqing" pinyin field is {chongqing,zhongqing}, the Abbre field is {CQ, zq}kwfreq for the user to search the critical frequency, Sort-------------------------------------------------------<copyfield source= for queries"KW"dest="Suggest"/><copyfield source="Pinyin"dest="Suggest"/><copyfield source="Abbre"dest="Suggest"/>------------------suggest_text----------------------------------<fieldtype name="Suggest_text" class="SOLR. TextField "positionincrementgap=" the"autogeneratephrasequeries="true"> <analyzer Type="Index"> <tokenizerclass="SOLR. Keywordtokenizerfactory "/> <filterclass="SOLR. Synonymfilterfactory "synonyms="Synonyms.txt"Ignorecase="true"Expand="true"/> <filterclass="SOLR. Stopfilterfactory "Ignorecase="true"words="Stopwords.txt"enablepositionincrements="true"/> <filterclass="SOLR. Lowercasefilterfactory "/> <filterclass="SOLR. Keywordmarkerfilterfactory "Protected="Protwords.txt"/> </analyzer> <analyzer Type="Query"> <tokenizerclass="SOLR. Keywordtokenizerfactory "/> <filterclass="SOLR. Stopfilterfactory "Ignorecase="true"words="Stopwords.txt"enablepositionincrements="true"/> <filterclass="SOLR. Lowercasefilterfactory "/> <filterclass="SOLR. Keywordmarkerfilterfactory "Protected="Protwords.txt"/> </analyzer></fieldType>
Keywordtokenizerfactory: This word breaker does not make any participle! The entire character stream becomes a single word element. The string field type also has a similar effect, but it cannot configure other processing components of text parsing, such as case conversions. Any index field used for sorting and most faceting functions, this index field can only have one word element in the original domain value.
Prefix query constructs:
private SolrQuery getSuggestQuery(String prefix, Integer limit) { new SolrQuery(); new StringBuilder(); sb.append(“suggest:").append(prefix).append("*"); solrQuery.setQuery(sb.toString()); solrQuery.addField("kw"); solrQuery.addField("kwfreq"); solrQuery.addSort("kwfreq", SolrQuery.ORDER.desc); solrQuery.setStart(0); solrQuery.setRows(limit); return solrQuery;}
The effect is as follows:
Original address: http://tech.meituan.com/pinyin-suggest.html
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
An implementation of intelligent tips for search engine keywords