"Apache SOLR Series" Jcseg and Pinyintokenfilter for Chinese abbreviation search

Last Update:2014-10-27 Source: Internet

Author: User

Tags apache solr solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint Please specify: http://blog.csdn.net/weijonathan/article/details/40504029

Today is to write about the word segmentation query.

Let's take a look at the picture below.

It should be said that a lot of search system will be involved in a topic; Chinese abbreviation Search, when you enter the corresponding Chinese abbreviation, will give you the corresponding Chinese phrases

And now there are a variety of articles on the Internet that do not have much to really describe how to achieve. After the study of the former, I am here to organize a study of my own effect,

First look at the relevant plug-ins and:

Jcseg and Official website

: http://git.oschina.net/lionsoul/jcseg

Website address: https://code.google.com/p/jcseg/

Pinyintokenfilter Plugin Address:

Https://github.com/Jonathan-Wei/pinyinTokenFilter

Jsceg is a Chinese word breaker, and pinyintokenfilter is a pinyin filter.

1, download jcseg after decompression, the Jcseg\output directory in the jar package copy to the SOLR installation directory

2, put jcseg lexicon word base into apache-tomcat-7.0.53\webapps\solr\web-inf\classes directory and configure Lexicon.path path

1), jcseg the default lexicon.path is the location of the word breaker is the Lib directory of the project published in SOLR, so you can choose to copy the Lexicon directory in the compressed package under the Lib package

2), configure the Lexicon.path configuration to the thesaurus directory you specify

Questions:

Here you may ask me why I don't use IK, because after Solr4.0 was released, the official canceled the Basetokenizerfactory interface and directly used the Lucene Analyzer standard interface. So the IK word breaker, version of FF, also cancels the Org.wltea.analyzer.solr.IKTokenizerFactory class. So IK could not configure the Fliter node; After configuring the boot tomcat directly on the error, the specific errors can be tested on their own, I am not specifically launched here!

3, download Pinyintokenfilter, (this plugin I made a little modification, the author of the current project there is a bug, the content of the changes are not many, you can see my github on the commit to modify the content) configuration Schame.xml Add the following configuration

<fieldtype name= "Text_pinyin" class= "SOLR. TextField "> <analyzer type=" index "> <tokenizer class=" org.lionsoul.jcseg.solr.JcsegTokenizerFact              Ory "mode=" complex "/> <filter class=" Me.dowen.solr.analyzers.PinyinTransformTokenFilterFactory " Isoutchinese= "true" firstchar= "true" mintermlength= "1"/> <!--<filter class= "Me.dowen.solr.analyzers.Pin Yintransformtokenfilterfactory "isoutchinese=" true "firstchar=" false "mintermlength=" 1 "/>--> & Lt;filter class= "SOLR. Stopfilterfactory "ignorecase=" true "words=" Stopwords.txt "/> <filter class=" SOLR. Lowercasefilterfactory "/> <filter class=" SOLR. Removeduplicatestokenfilterfactory "/> </analyzer> <analyzer type=" Query "> <tokeni Zer class= "Org.lionsoul.jcseg.solr.JcsegTokenizerFactory" mode= "complex"/> <filter class= "SOLR. Synonymfilterfactory "synonyms=" synonyms.TXT "ignorecase=" true "expand=" true "/> <filter class=" SOLR. Stopfilterfactory "ignorecase=" true "words=" Stopwords.txt "/> <filter class=" SOLR. Lowercasefilterfactory "/> </analyzer> </fieldType>

Note here that this plugin supports pinyin full spelling as well as abbreviations, but when the configuration abbreviation is used together with the full frequency, it does not seem to make much difference. So I have only configured an abbreviation here, let everyone see the effect;

Take a look at some instructions for the parameters:

Isoutchinese: Whether to retain the original input chinese word element. Optional value: True (default)/false

Firstchar: Output The full phonetic format or output simple spelling. The simple spelling output is composed of the first letter of the phonetic result of the characters of the original Chinese word element. Optional value: True (default)/false

Mintermlength: Only the phonetic results of Chinese words with words greater than or equal to mintermlenght are output. The default value is 2.

Next, look at the results of the analysis in SOLR:

Here everyone can see, "Monkey ball Rise" abbreviation is divided into X,XQ,XQJ,XQJQ, when you enter these 4 abbreviations, it will give you a hint containing "monkey ball rise" content;

Next look at our query results:

Can see, I input xqj to query, is can find monkey ball rise this content. Here Monkey ball rise is a word, jcseg is supported by custom thesaurus, so here I configured a simple thesaurus of my own;

Pinyintokenfilter This plug-in, if not with other word breakers used, using the method of the plugin readme inside the test, you will see that you enter the word will be divided into words of one word. Used in the Readme is

<tokenizer class= "SOLR. Standardtokenizerfactory "/>

Pinyintokenfilter This plugin also has incomplete place, so everybody use process of what problem can mention, I also follow. Thank you.

"Apache SOLR Series" Jcseg and Pinyintokenfilter for Chinese abbreviation search

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More