Reprint Please specify: http://blog.csdn.net/weijonathan/article/details/40504029
Today is to write about the word segmentation query.
Let's take a look at the picture below.
It should be said that a lot of search system will be involved in a topic; Chinese abbreviation Search, when you enter the corresponding Chinese abbreviation, will give you the corresponding Chinese phrases
And now there are a variety of articles on the Internet that do not have much to really describe how to achieve. After the study of the former, I am here to organize a study of my own effect,
First look at the relevant plug-ins and:
Jcseg and Official website
: http://git.oschina.net/lionsoul/jcseg
Website address: https://code.google.com/p/jcseg/
Pinyintokenfilter Plugin Address:
Https://github.com/Jonathan-Wei/pinyinTokenFilter
Jsceg is a Chinese word breaker, and pinyintokenfilter is a pinyin filter.
1, download jcseg after decompression, the Jcseg\output directory in the jar package copy to the SOLR installation directory
2, put jcseg lexicon word base into apache-tomcat-7.0.53\webapps\solr\web-inf\classes directory and configure Lexicon.path path
1), jcseg the default lexicon.path is the location of the word breaker is the Lib directory of the project published in SOLR, so you can choose to copy the Lexicon directory in the compressed package under the Lib package
2), configure the Lexicon.path configuration to the thesaurus directory you specify
Questions:
Here you may ask me why I don't use IK, because after Solr4.0 was released, the official canceled the Basetokenizerfactory interface and directly used the Lucene Analyzer standard interface. So the IK word breaker, version of FF, also cancels the Org.wltea.analyzer.solr.IKTokenizerFactory class. So IK could not configure the Fliter node; After configuring the boot tomcat directly on the error, the specific errors can be tested on their own, I am not specifically launched here!
3, download Pinyintokenfilter, (this plugin I made a little modification, the author of the current project there is a bug, the content of the changes are not many, you can see my github on the commit to modify the content) configuration Schame.xml Add the following configuration
<fieldtype name= "Text_pinyin" class= "SOLR. TextField "> <analyzer type=" index "> <tokenizer class=" org.lionsoul.jcseg.solr.JcsegTokenizerFact Ory "mode=" complex "/> <filter class=" Me.dowen.solr.analyzers.PinyinTransformTokenFilterFactory " Isoutchinese= "true" firstchar= "true" mintermlength= "1"/> <!--<filter class= "Me.dowen.solr.analyzers.Pin Yintransformtokenfilterfactory "isoutchinese=" true "firstchar=" false "mintermlength=" 1 "/>--> & Lt;filter class= "SOLR. Stopfilterfactory "ignorecase=" true "words=" Stopwords.txt "/> <filter class=" SOLR. Lowercasefilterfactory "/> <filter class=" SOLR. Removeduplicatestokenfilterfactory "/> </analyzer> <analyzer type=" Query "> <tokeni Zer class= "Org.lionsoul.jcseg.solr.JcsegTokenizerFactory" mode= "complex"/> <filter class= "SOLR. Synonymfilterfactory "synonyms=" synonyms.TXT "ignorecase=" true "expand=" true "/> <filter class=" SOLR. Stopfilterfactory "ignorecase=" true "words=" Stopwords.txt "/> <filter class=" SOLR. Lowercasefilterfactory "/> </analyzer> </fieldType>
Note here that this plugin supports pinyin full spelling as well as abbreviations, but when the configuration abbreviation is used together with the full frequency, it does not seem to make much difference. So I have only configured an abbreviation here, let everyone see the effect;
Take a look at some instructions for the parameters:
Isoutchinese: Whether to retain the original input chinese word element. Optional value: True (default)/false
Firstchar: Output The full phonetic format or output simple spelling. The simple spelling output is composed of the first letter of the phonetic result of the characters of the original Chinese word element. Optional value: True (default)/false
Mintermlength: Only the phonetic results of Chinese words with words greater than or equal to mintermlenght are output. The default value is 2.
Next, look at the results of the analysis in SOLR:
Here everyone can see, "Monkey ball Rise" abbreviation is divided into X,XQ,XQJ,XQJQ, when you enter these 4 abbreviations, it will give you a hint containing "monkey ball rise" content;
Next look at our query results:
Can see, I input xqj to query, is can find monkey ball rise this content. Here Monkey ball rise is a word, jcseg is supported by custom thesaurus, so here I configured a simple thesaurus of my own;
Pinyintokenfilter This plug-in, if not with other word breakers used, using the method of the plugin readme inside the test, you will see that you enter the word will be divided into words of one word. Used in the Readme is
<tokenizer class= "SOLR. Standardtokenizerfactory "/>
Pinyintokenfilter This plugin also has incomplete place, so everybody use process of what problem can mention, I also follow. Thank you.
"Apache SOLR Series" Jcseg and Pinyintokenfilter for Chinese abbreviation search