Configuring a Chinese word breaker for SOLR

Source: Internet
Author: User
Tags solr tomcat advantage
SOLR's Chinese word breakerChinese word segmentation in SOLR is not enabled by default, we need to configure a Chinese word breaker. The current available word breakers have smartcn,ik,jeasy, Cook looked through. In fact, mainly two, one is based on the Chinese Academy of Sciences Ictclas Implicit Markov hmm algorithm, such as SMARTCN,ICTCLAS4J, the advantage is the high accuracy of the word segmentation, the disadvantage is that users can not use custom thesaurus; the other is based on the largest matching word breaker, such as IK, jeasy , Cook looked through, the advantage is that you can customize the thesaurus, add new words, the disadvantage is that there are more garbage words. Each has its advantages and disadvantages. The surface gives two kinds of word breaker installation method, any one can choose one, recommend the first, because SMARTCN in the contrib/analysis-extras/lucene-libs/of the SOLR release package, is Lucene-analyzers-smartcn-4.2.0.jar, preferred to add a sentence in Solrconfig.xml to refer to the configuration of Analysis-extras, so that we join the word breaker will be cited in SOLR.
SMARTCN The installation of the word breaker
1. Preferred to copy the Contrib/analysis-extras/lucene-libs/lucene-analyzers-smartcn-4.2.0.jar of the release package to \solr\contrib\ Under Analysis-extras\lib, under the Solr_home folder
2. Open/ims_advertiesr_core/conf/ Scheme.xml, edit the Text field type as follows, add the following code to the corresponding location in Scheme.xml, is to find the FieldType definition of the paragraph, add the paragraph below more.
<fieldtype name= "TEXT_SMARTCN" class= "SOLR. TextField "positionincrementgap=" 0 ">
      <analyzer type=" index ">
        <tokenizer class=" Org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory "/>
        <filter class=" Org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory "/>
      </analyzer>
      < Analyzer type= "Query" >
         <tokenizer class= " Org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory "/>
        <filter class=" Org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory "/>
      </analyzer>
</ Fieldtype>
If you need to retrieve a field, you also need to add the specified field in field below Scheme.xml, and use Text_ SMARTCN as the name of the type to complete the Chinese word segmentation. If the text to achieve the Chinese search, it is necessary to do the following configuration:
<field name = "text" type = "TEXT_SMARTCN" indexed = "true" stored = "false" multivalued = "true"/>
installation of IK word breakers
Ikanalyzer2012ff_u1.jar       //Word breaker jar package
IKAnalyzer.cfg.xml            //Word breaker configuration file
stopword.dic                  //Word breaker stop Word dictionary, Add content can be customized
Add the Ikanalyzer2012ff_u1.jar to the C:\apache-tomcat-7.0.57\webapps\solr\WEB-INF\lib. Create a new Classes folder under C:\apache-tomcat-7.0.57\webapps\solr\WEB-INF, IKAnalyzer.cfg.xml, Keyword.dic, Stopword.dic joins classes. Then you can configure scheme.xml like SMARTCN.
<!--Configure the IK word breaker start
 -to-<fieldtype name= "Text_ik" class= "SOLR. TextField "positionincrementgap=" >
    <analyzer type= "index" >
    <tokenizer class= " Org.wltea.analyzer.lucene.IKTokenizerFactory "usesmart=" false "Ismaxwordlength=" false "/>
        <filter class= "SOLR. Lowercasefilterfactory "/>
    </analyzer>

    <analyzer type=" Query ">
        <tokenizer class=" Org.wltea.analyzer.lucene.IKTokenizerFactory "usesmart=" false "Ismaxwordlength=" false "/>
        <filter class= "SOLR. Lowercasefilterfactory "/>
    </analyzer>
</fieldType>
Chinese word breaker mmseg4j

mmseg4j-solr-2.3.0 Support solr5.3
1. Test two jar packages into lib files in the SOLR project in Tomcat

2. Configuring the Schema.xml in Solr_home
In the following tab

<fieldtype name= "Currency" class= "SOLR. Currencyfield "precisionstep=" 8 "defaultcurrency=" USD "currencyconfig=" Currency.xml "/></fieldtype>

New in:

<fieldtype name= "Textcomplex" class= "SOLR. TextField "positionincrementgap=" > 
  <analyzer> 
    <tokenizer class= " Com.chenlb.mmseg4j.solr.MMSegTokenizerFactory "mode=" complex "dicpath=" dic "/> 
  </analyzer> 
</ fieldtype> 

<fieldtype name= "Textmaxword" class= "SOLR. TextField "positionincrementgap=" > 
  <analyzer> 
    <tokenizer class= " Com.chenlb.mmseg4j.solr.MMSegTokenizerFactory "mode=" Max-word "/> 
  </analyzer> 
</fieldtype >

<fieldtype name= "textsimple" class= "SOLR. TextField "positionincrementgap=" > 
  <analyzer> 
    <tokenizer class= " Com.chenlb.mmseg4j.solr.MMSegTokenizerFactory "mode=" simple "dicpath=" N:/custom/path/to/my_dic "/> 
  </ Analyzer> 
</fieldtype>
Restart tomcat test participle

Defined in Schema.xml:

<field name= "Content_test" type= "Textmaxword" indexed= "true" stored= "true" multivalued= "true"/>

Then test:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.