Full-Text search engine SOLR series--Integrated Chinese sub-phrase Ikanalyzer

Source: Internet
Author: User
Tags solr

IK Analyzer is a Chinese word breaker that combines dictionary and grammar analysis algorithms, based on string matching, supports user dictionary extension definitions, and supports fine-grained and intelligent segmentation, such as:

张三说的确实在理

The result of intelligent segmentation is:

张三 |  说的 |  确实 |  

Maximum granularity of Word segmentation results:

张三 |  三 |  说的 |  的确 |  的 |  确实 |  实在 |  在理

Integrating IK Analyzer is much simpler than mmseg4j, download extract Ikanalyzer2012ff_u1.jar to directory: E:\solr-4.8.0\example\solr-webapp\webapp\WEB-INF\ LIB, modify the configuration file Schema.xml, add code:

123456 <field name="content" type="text_ik" indexed="true" stored="true"/> <fieldType name="text_ik" class="solr.TextField">      <analyzer type="index" isMaxWordLength="false" class="org.wltea.analyzer.lucene.IKAnalyzer"/>      <analyzer type="query" isMaxWordLength="true" class="org.wltea.analyzer.lucene.IKAnalyzer"/></fieldType>
查询采用IK自己的最大分词法,索引则采用它的细粒度分词法

At this point even if the configuration is complete, restart the service: Java-jar Start.jar, to see how the Ikanalyzer Word segmentation effect, open the SOLR management interface, click on the Analysis page on the left

The default word breaker makes the most granular segmentation. Ikanalyzer supports the configuration of IKAnalyzer.cfg.xml files to augment your dictionary and stop dictionaries (filter dictionaries), just put the IKAnalyzer.cfg.xml file under the class directory, specify your own dictionary Mydic.dic

1234567891011 <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">   <properties>     <comment>IK Analyzer 扩展配置</comment>   <!--用户可以在这里配置自己的扩展字典  -->    <entry key="ext_dict">/mydict.dic; /com/mycompany/dic/mydict2.dic;</entry>      <!--用户可以在这里配置自己的扩展停止词字典-->   <entry key="ext_stopwords">/ext_stopword.dic</entry>    </properties>

In fact, the previous FieldType configuration is in fact problematic, according to the latest IK version of IK Analyzer 2012ff_hf1.zip, index using the most fine-grained word segmentation, query when the maximum word segmentation (Intelligent word segmentation) is actually not effective.

According to the author Linliangyi, in this version of 2012FF_HF1 has been repaired, tested or useless, see this post for details.

Workaround: Re-implement Ikanalyzersolrfactory
1234567891011121314151617181920212223242526272829303132333435363738 package org.wltea.analyzer.lucene;      import java.io.Reader;   import java.util.Map;      import org.apache.lucene.analysis.Tokenizer;   import org.apache.lucene.analysis.util.TokenizerFactory;   //lucene:4.8之前的版本   //import org.apache.lucene.util.AttributeSource.AttributeFactory;   //lucene:4.9   import org.apache.lucene.util.AttributeFactory;      public class IKAnalyzerSolrFactory extends TokenizerFactory{              private boolean useSmart;              public boolean useSmart() {           return useSmart;       }              public void setUseSmart(boolean useSmart) {           this.useSmart = useSmart;       }               public IKAnalyzerSolrFactory(Map<String,String> args) {            super(args);            assureMatchVersion();            this.setUseSmart(args.get("useSmart").toString().equals("true"));          }             @Override       public Tokenizer create(AttributeFactory factory, Reader input) {           Tokenizer _IKTokenizer = new IKTokenizer(input , this.useSmart);           return _IKTokenizer;       }      }

Update the jar file after recompiling, update the Schema.xml file:

12345678 <fieldType name="text_ik" class="solr.TextField" >        <analyzer type="index">            <tokenizer class="org.wltea.analyzer.lucene.IKAnalyzerSolrFactory" useSmart="false"/>        </analyzer>         <analyzer type="query">            <tokenizer class="org.wltea.analyzer.lucene.IKAnalyzerSolrFactory" useSmart="true"/>        </analyzer> </fieldType>

Full-Text search engine SOLR series--Integrated Chinese sub-phrase Ikanalyzer

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.