"Apache SOLR Series" Jcseg and Pinyintokenfilter for Chinese abbreviation search

Source: Internet
Author: User
Tags apache solr solr

Reprint Please specify: http://blog.csdn.net/weijonathan/article/details/40504029

Today is to write about the word segmentation query.

Let's take a look at the picture below.


It should be said that a lot of search system will be involved in a topic; Chinese abbreviation Search, when you enter the corresponding Chinese abbreviation, will give you the corresponding Chinese phrases

And now there are a variety of articles on the Internet that do not have much to really describe how to achieve. After the study of the former, I am here to organize a study of my own effect,

First look at the relevant plug-ins and:

Jcseg and Official website

: http://git.oschina.net/lionsoul/jcseg

Website address: https://code.google.com/p/jcseg/

Pinyintokenfilter Plugin Address:

Https://github.com/Jonathan-Wei/pinyinTokenFilter

Jsceg is a Chinese word breaker, and pinyintokenfilter is a pinyin filter.

1, download jcseg after decompression, the Jcseg\output directory in the jar package copy to the SOLR installation directory

2, put jcseg lexicon word base into apache-tomcat-7.0.53\webapps\solr\web-inf\classes directory and configure Lexicon.path path

1), jcseg the default lexicon.path is the location of the word breaker is the Lib directory of the project published in SOLR, so you can choose to copy the Lexicon directory in the compressed package under the Lib package

2), configure the Lexicon.path configuration to the thesaurus directory you specify

Questions:

Here you may ask me why I don't use IK, because after Solr4.0 was released, the official canceled the Basetokenizerfactory interface and directly used the Lucene Analyzer standard interface. So the IK word breaker, version of FF, also cancels the Org.wltea.analyzer.solr.IKTokenizerFactory class. So IK could not configure the Fliter node; After configuring the boot tomcat directly on the error, the specific errors can be tested on their own, I am not specifically launched here!

3, download Pinyintokenfilter, (this plugin I made a little modification, the author of the current project there is a bug, the content of the changes are not many, you can see my github on the commit to modify the content) configuration Schame.xml Add the following configuration

<fieldtype name= "Text_pinyin" class= "SOLR. TextField "> <analyzer type=" index "> <tokenizer class=" org.lionsoul.jcseg.solr.JcsegTokenizerFact              Ory "mode=" complex "/> <filter class=" Me.dowen.solr.analyzers.PinyinTransformTokenFilterFactory " Isoutchinese= "true" firstchar= "true" mintermlength= "1"/> <!--<filter class= "Me.dowen.solr.analyzers.Pin Yintransformtokenfilterfactory "isoutchinese=" true "firstchar=" false "mintermlength=" 1 "/>--> & Lt;filter class= "SOLR. Stopfilterfactory "ignorecase=" true "words=" Stopwords.txt "/> <filter class=" SOLR. Lowercasefilterfactory "/> <filter class=" SOLR. Removeduplicatestokenfilterfactory "/> </analyzer> <analyzer type=" Query "> <tokeni Zer class= "Org.lionsoul.jcseg.solr.JcsegTokenizerFactory" mode= "complex"/> <filter class= "SOLR. Synonymfilterfactory "synonyms=" synonyms.TXT "ignorecase=" true "expand=" true "/> <filter class=" SOLR. Stopfilterfactory "ignorecase=" true "words=" Stopwords.txt "/> <filter class=" SOLR. Lowercasefilterfactory "/> </analyzer> </fieldType>
Note here that this plugin supports pinyin full spelling as well as abbreviations, but when the configuration abbreviation is used together with the full frequency, it does not seem to make much difference. So I have only configured an abbreviation here, let everyone see the effect;

Take a look at some instructions for the parameters:

Isoutchinese: Whether to retain the original input chinese word element. Optional value: True (default)/false

Firstchar: Output The full phonetic format or output simple spelling. The simple spelling output is composed of the first letter of the phonetic result of the characters of the original Chinese word element. Optional value: True (default)/false

Mintermlength: Only the phonetic results of Chinese words with words greater than or equal to mintermlenght are output. The default value is 2.

Next, look at the results of the analysis in SOLR:


Here everyone can see, "Monkey ball Rise" abbreviation is divided into X,XQ,XQJ,XQJQ, when you enter these 4 abbreviations, it will give you a hint containing "monkey ball rise" content;

Next look at our query results:

Can see, I input xqj to query, is can find monkey ball rise this content. Here Monkey ball rise is a word, jcseg is supported by custom thesaurus, so here I configured a simple thesaurus of my own;

Pinyintokenfilter This plug-in, if not with other word breakers used, using the method of the plugin readme inside the test, you will see that you enter the word will be divided into words of one word. Used in the Readme is

<tokenizer class= "SOLR. Standardtokenizerfactory "/>
Pinyintokenfilter This plugin also has incomplete place, so everybody use process of what problem can mention, I also follow. Thank you.




"Apache SOLR Series" Jcseg and Pinyintokenfilter for Chinese abbreviation search

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.