Based on search recommendation system according to user search frequency (hot search) sort

Source: Internet
Author: User
Tags solr

The previously written three-pronged tree, a little bit simple, does not meet the needs of the actual project. The core algorithm of the search recommendation system in SOLR is briefly analyzed.

The wiki has a detailed description of the search recommendations for SOLR, but the core algorithm needs to look at the source code itself. The interpretation on the wiki, before a simple translation, according to this document, the detailed study of the source code, the first to present the core ideas.

The basic flow is as follows: When the user enters a search term prefix, invokes SOLR's suggest through the front-end, finds the Suggeser object, and Suggester reads the terms from the main index library based on the matching field to build Dictionry, Because the terms in the primary index library are merged and sorted, the index eliminates the process of sorting with PINYIN4J components when constructing a quadtree. Next, through the Compromise of the dictionary, to realize the self-balanced three-fork tree, to improve retrieval efficiency. After the three-fork tree is built, the prefix matching query is searched, all matching words are found, then the priority queue is added, the finite capacity heap is built, and the value of the heap top is minimized. The reason that Lucene wrote Priorityqueue<t>, rather than the JDK itself, is because of the priorityqueue<t> of the JDK, and the capacity can be expanded, he will add all the matched words, and then output the top n the word element, which is obviously a waste of memory. In a previous blog about finding top N data from massive amounts of data, the idea of heap sequencing has been elaborated and not described. The result is finally output through the priority queue.

Suggester---Lookup----lookupimpl (Tstlookup, Jaspelllookup, Fstlookup), previously studied tstlookup. The core idea of sorting is: After constructing the dictionary, get the Dictionry object, get Inputiterator by the Dictionary object, scan the dictionary to read, can read two variables: one for the term, the other for the term of the weight, sort with. After the dictionary scan is complete, load the terms and weight into two lists to insert the three-pronged tree. So, the design of the Quadtree node object is very important. Encapsulates the following properties: Storedchar, Val (weight), token (the term word is stored in the last node). Insert the specific logic, self-written to the last three-fork tree, has been improved, the code is as follows:

Package chinese.utility.ternaryTree;
/**
* Three-fork tree node
* @author Tongxueqiang
* @date 2016/03/12
* @since JDK 1.7
*/
public class Ternarynode {
Public Char storedchar;//a single character stored by the node
Public String token;//The last node stored term (word)
Public Ternarynode leftnode,centernode,rightnode;//child nodes

Public Ternarynode (char Storedchar) {
This.storedchar = Storedchar;
}
}

Package chinese.utility.ternaryTree;

Import java.util.ArrayList;
Import java.util.List;
Import Java.util.Stack;

/**
* Custom three-fork tree
*
* @author Tongxueqiang
* @date 2016/03/12
* @since JDK 1.7
*/
public class Ternarytree {
root node, no characters stored
private static Ternarynode root = new Ternarynode (' + ');

/**
* Inserting characters into the tree
*
* @param ternarynode
* @param word
* @return
*/
public void Insert (String word) {
Root = insert (root, Word, 0);
}

Public Ternarynode Insert (Ternarynode currentternarynode, String Word, int index) {
if (Word = = NULL | | word.length () < index) {
return currentternarynode;
}
char[] Chararray = Word.tochararray ();

if (Currentternarynode = = null) {
Currentternarynode = new Ternarynode (Chararray[index]);
}

if (Currentternarynode.storedchar > Chararray[index]) {
Currentternarynode.leftnode = insert (Currentternarynode.leftnode, Word, index);
} else if (Currentternarynode.storedchar < Chararray[index]) {
Currentternarynode.rightnode = insert (Currentternarynode.rightnode, Word, index);
} else {
if (Index! = word.length ()-1) {
Currentternarynode.centernode = insert (Currentternarynode.centernode, Word, index + 1);
} else {
Currentternarynode.token = Word;
}
}

return currentternarynode;
}

/**
* Find all strings starting with the specified prefix
*
* @param prefix
* @return
*/
Public list<ternarynode> prefixcompletion (String prefix) {
First find the last node of the prefix
Ternarynode ternarynode = findNode (prefix);
Return prefixcompletion (Ternarynode);
}

Public list<ternarynode> prefixcompletion (Ternarynode p) {
List<ternarynode> suggest = new arraylist<ternarynode> ();

if (p = = null) return suggest;
if (P.centernode = = null) && (P.token = = null)) return suggest;
if ((P.centernode = = null) && (P.token! = null)) {
Suggest.add (P);
return suggest;
}

if (P.token! = null) {
Suggest.add (P);
}

p = p.centernode;

stack<ternarynode> s = new stack<ternarynode> ();
S.push (P);
while (!s.isempty ()) {
Ternarynode top = S.pop ();

if (Top.token! = null) {
Suggest.add (top);
}
if (Top.centernode! = null) {
S.push (Top.centernode);
}
if (Top.leftnode! = null) {
S.push (Top.leftnode);
}
if (Top.rightnode! = null) {
S.push (Top.rightnode);
}
}
return suggest;
}

/**
* Find the next centerternarynode of the prefix as the start node of the Searchprefix
*
* @param prefix
* @return
*/
Public Ternarynode findNode (String prefix) {
return FindNode (root, prefix, 0);
}

Private Ternarynode FindNode (ternarynode ternarynode, String prefix, int index) {
if (prefix = = NULL | | prefix.equals ("")) {
return null;
}
if (Ternarynode = = null) {
return null;
}
char[] Chararray = Prefix.tochararray ();
If the current character is less than the character stored by the current node, find the left node
if (Chararray[index] < Ternarynode.storedchar) {
Return FindNode (ternarynode.leftnode, prefix, index);
}
If the current character is larger than the character stored by the current node, find the right node
else if (Chararray[index] > Ternarynode.storedchar) {
Return FindNode (ternarynode.rightnode, prefix, index);
} else {//equal
recursive termination conditions
if (index!=chararray.length-1) {
Return FindNode (ternarynode.centernode, prefix, ++index);
}
}
return ternarynode;
}
}

After the improved three-fork tree, the prefix matching lookup results in the result is no longer a simple string, but a node object, which encapsulates the matching term and weight values. Next, the resulting suggest are added to the priority queue, resulting in ascending order. The test code is as follows:

Package chinese.utility.test;

Import Java.util.Stack;

Import Chinese.utility.utils.PriorityQueue;

/**
* Output TOPN, with Lucene Priorityqueue, the idea is similar to the heap sort previously written, the capacity in the queue is N,
* When inserting into the queue, adjust the heap so that the top element of the heap is minimized, and when the capacity is exceeded, when inserting
* The element is larger than the top element of the heap, then replaced, if small, obsolete. Finally, the output topn is sorted in ascending order. To implement LessThan
* method to determine the comparison of the items, such as the weight of the input token weight.
* @author Hadoop
* @date 2016/03/11
*/
public class Mypriorityqueue extends Priorityqueue<object> {

public mypriorityqueue (int num) {
Super (NUM);
}

@Override
Protected Boolean LessThan (object A, object B) {
Return (integer) a < (integer) b;
}

public static void Main (string[] f) {
Mypriorityqueue priqueue = new Mypriorityqueue (3);
Priqueue.insertwithoverflow (100);
Priqueue.insertwithoverflow (98);
Priqueue.insertwithoverflow (84);
Priqueue.insertwithoverflow (78);
Print results, output top three maximum, sorted in descending order
stack<integer> stack = new stack<integer> ();
Integer v = (integer) priqueue.pop ();

while (v! = null) {
Stack.push (v);
v = (Integer) priqueue.pop ();
}

Integer num = Stack.pop ();
while (num! = null) {
SYSTEM.OUT.PRINTLN (num);
if (Stack.isempty ()) break;
num = Stack.pop ();
}
}
}

The result is:

100
98
84

However, the above ranking is based on the term of the thesaurus frequency, not based on the user's search frequency, not the implementation of hot search. Another solution is needed to achieve thermal search. In a previous blog, there is a topic: when the user searches, the search engine log, will record the user's search trajectory. For example, there is a document, there is a number of tens, the removal of duplicates may have 3 million lines, how to find the top 10 of the search string? Previously just wrote the idea, did not give the concrete realization. The actual project, need to learn from this idea, re-organize:

The first step: sort and compromise the text, update the text, you need to use the PINYIN4J project package;

The second step: The updated dictionary, loaded into the three-fork tree, to achieve a balanced three-fork tree, custom three-fork tree to increase the number of occurrences of the node character variables, in order to achieve word frequency statistics;

The third step: traverse the dictionary, each read the words, with the three-tree query, get the frequency, and then read the words and frequencies written to another file, separated by a space, similar to the Key-value key value pair form;

The fourth step: the same as the algorithm (Java) that finds the first k minimum or maximum value from the mass data, and finds the first 10 minimum values from the mass data;

Fifth step: Get the minimum frequency value of the heap, the new text to find the corresponding words, added to set, the uniform frequency of the word will have a lot, rather than one.

In the actual project, how to record the user's search frequency? The above ideas can only be used as a reference to see how the American Regiment is to do:

Consider establishing an index collection specifically for the keyword, using the SOLR prefix query implementation. The Copyfield in SOLR solves the need to index multiple fields at the same time (kanji, pinyin, abbre), and the multivalued property of field is set to True to resolve the polyphone composition of the same keyword. The configuration is as follows:

Schema.xml:<field name= "kw" Type= "string" indexed= "true" stored= "true"/> <field name= "Pinyin" type= "string" Indexed= "true" stored= "false" multivalued= "true"/><field name= "Abbre" type= "string" indexed= "true" stored= " False "multivalued=" true "/><field name=" kwfreq "type=" int "indexed=" true "stored=" true "/><field name=" _ Version_ "type=" Long "indexed=" true "stored=" true "/><field name=" suggest "type=" Suggest_text "indexed=" true " Stored= "false" multivalued= "true"/>------------------ Multivalued indicates that the field is multivalued-------------------------------------<uniqueKey>kw</uniqueKey>< Defaultsearchfield>suggest</defaultsearchfield> Description: KW is the multivalued=true of the original keyword pinyin and abbre, when using SOLRJ to build this index , defined as a collection type: the Pinyin field for the keyword "Chongqing" is {chongqing,zhongqing}, the Abbre field is {CQ, Zq}kwfreq is the key frequency for the user to search, Sort-------------------------------------------------------<copyfield source= "kw" dest= "suggest"/> for queries <copyfield source= "Pinyin" dest= "suggest"/><copyfield source= "Abbre" dest= "suggest"/>------------------suggest_text----------------------------------<fieldtype name= "Suggest_ Text "class=" SOLR. TextField "positionincrementgap=" autogeneratephrasequeries= "true" > <analyzer type= "index" > &lt ; Tokenizer class= "SOLR. Keywordtokenizerfactory "/> &lt;filter class=" SOLR.                     Synonymfilterfactory "synonyms=" Synonyms.txt "ignorecase=" true " Expand= "true"/> <filter class= "SOLR. Stopfilterfactory "ignorecase=" true "words=" Stopwords.txt "ena Blepositionincrements= "true"/> <filter class= "SOLR. Lowercasefilterfactory "/> <filter class=" SOLR.            Keywordmarkerfilterfactory "protected=" Protwords.txt "/> </analyzer> <analyzer type=" Query "> <tokenizer class= "SOLR. Keywordtokenizerfactory "/> <filter class=" SOLR. StopfiLterfactory "ignorecase=" true "words=" Stopwords.txt "Enablepos Itionincrements= "true"/> <filter class= "SOLR. Lowercasefilterfactory "/> <filter class=" SOLR. Keywordmarkerfilterfactory "protected=" Protwords.txt "/> </analyzer></fieldType>

Keywordtokenizerfactory: This word breaker does not make any participle! The entire character stream becomes a single word element. The string field type also has a similar effect, but it cannot configure other processing components of text parsing, such as case conversions. Any index field used for sorting and most faceting functions, this index field can only have one word element in the original domain value.

Prefix query constructs:

private SolrQuery getSuggestQuery(String prefix, Integer limit) {    SolrQuery solrQuery = new SolrQuery();    StringBuilder sb = new StringBuilder();    sb.append(“suggest:").append(prefix).append("*");    solrQuery.setQuery(sb.toString());    solrQuery.addField("kw");    solrQuery.addField("kwfreq");    solrQuery.addSort("kwfreq", SolrQuery.ORDER.desc);    solrQuery.setStart(0);    solrQuery.setRows(limit);    return solrQuery;}

Based on search recommendation system according to user search frequency (hot search) sort

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.