Full-Text search engine SOLR series--Integrated Chinese sub-phrase Ikanalyzer

Last Update:2016-01-11 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

IK Analyzer is a Chinese word breaker that combines dictionary and grammar analysis algorithms, based on string matching, supports user dictionary extension definitions, and supports fine-grained and intelligent segmentation, such as:

张三说的确实在理

The result of intelligent segmentation is:

张三 |  说的 |  确实 |

Maximum granularity of Word segmentation results:

张三 |  三 |  说的 |  的确 |  的 |  确实 |  实在 |  在理

Integrating IK Analyzer is much simpler than mmseg4j, download extract Ikanalyzer2012ff_u1.jar to directory: E:\solr-4.8.0\example\solr-webapp\webapp\WEB-INF\ LIB, modify the configuration file Schema.xml, add code:

123456 <field name="content" type="text_ik" indexed="true" stored="true"/> <fieldType name="text_ik" class="solr.TextField"> <analyzer type="index" isMaxWordLength="false" class="org.wltea.analyzer.lucene.IKAnalyzer"/> <analyzer type="query" isMaxWordLength="true" class="org.wltea.analyzer.lucene.IKAnalyzer"/></fieldType>

查询采用IK自己的最大分词法,索引则采用它的细粒度分词法

At this point even if the configuration is complete, restart the service: Java-jar Start.jar, to see how the Ikanalyzer Word segmentation effect, open the SOLR management interface, click on the Analysis page on the left

The default word breaker makes the most granular segmentation. Ikanalyzer supports the configuration of IKAnalyzer.cfg.xml files to augment your dictionary and stop dictionaries (filter dictionaries), just put the IKAnalyzer.cfg.xml file under the class directory, specify your own dictionary Mydic.dic

1234567891011 <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment>  <entry key="ext_dict">/mydict.dic; /com/mycompany/dic/mydict2.dic;</entry>  <entry key="ext_stopwords">/ext_stopword.dic</entry> </properties>

In fact, the previous FieldType configuration is in fact problematic, according to the latest IK version of IK Analyzer 2012ff_hf1.zip, index using the most fine-grained word segmentation, query when the maximum word segmentation (Intelligent word segmentation) is actually not effective.

According to the author Linliangyi, in this version of 2012FF_HF1 has been repaired, tested or useless, see this post for details.

Workaround: Re-implement Ikanalyzersolrfactory

1234567891011121314151617181920212223242526272829303132333435363738 package org.wltea.analyzer.lucene; import java.io.Reader; import java.util.Map; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.util.TokenizerFactory; //lucene:4.8之前的版本 //import org.apache.lucene.util.AttributeSource.AttributeFactory; //lucene:4.9 import org.apache.lucene.util.AttributeFactory; public class IKAnalyzerSolrFactory extends TokenizerFactory{ private boolean useSmart; public boolean useSmart() { return useSmart; } public void setUseSmart(boolean useSmart) { this.useSmart = useSmart; } public IKAnalyzerSolrFactory(Map<String,String> args) { super(args); assureMatchVersion(); this.setUseSmart(args.get("useSmart").toString().equals("true")); } @Override public Tokenizer create(AttributeFactory factory, Reader input) { Tokenizer _IKTokenizer = new IKTokenizer(input , this.useSmart); return _IKTokenizer; } }

Update the jar file after recompiling, update the Schema.xml file:

12345678 <fieldType name="text_ik" class="solr.TextField" > <analyzer type="index"> <tokenizer class="org.wltea.analyzer.lucene.IKAnalyzerSolrFactory" useSmart="false"/> </analyzer> <analyzer type="query"> <tokenizer class="org.wltea.analyzer.lucene.IKAnalyzerSolrFactory" useSmart="true"/> </analyzer> </fieldType>

Full-Text search engine SOLR series--Integrated Chinese sub-phrase Ikanalyzer

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More