IK word breaker integration solr4.7 with synonyms, segmentation words, stop words

Last Update:2016-06-04 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

If the IK word breaker is configured as

<fieldtype name= "Text_ik" class= "SOLR. TextField ">       <analyzer type=" index "ismaxwordlength=" false "class=" Org.wltea.analyzer.lucene.IKAnalyzer " />           <analyzer type= "Query" ismaxwordlength= "true" class= "Org.wltea.analyzer.lucene.IKAnalyzer"/></ Fieldtype>

I test the words can be divided, but synonyms, expand the thesaurus is not used,

Online check all kinds of information said IK word breaker has a bug, to own jar file to change, so find IK source code, inside only Ikanalyzer of the source codes are as follows

Package Org.wltea.analyzer.lucene;import Java.io.reader;import Org.apache.lucene.analysis.analyzer;import org.apache.lucene.analysis.tokenizer;/** * IK word breaker, Lucene Analyzer interface implementation * Compatible with Lucene 4.0 version */public final class Ikanalyzer ext Ends Analyzer{private Boolean usesmart;public boolean Usesmart () {return usesmart;} public void Setusesmart (Boolean usesmart) {this.usesmart = Usesmart;} /** * IK word breaker Lucene  Analyzer Interface Implementation class *  * Default fine-grained segmentation algorithm */public Ikanalyzer () {this (false);} /** * IK word breaker Lucene Analyzer Interface Implementation class *  * @param usesmart when True, the word breaker intelligently shards */public Ikanalyzer (Boolean Usesmart) {super (); This.usesmart = Usesmart;} /** * Overload Analyzer interface, construct sub-phrase */@Overrideprotected tokenstreamcomponents createcomponents (String fieldName, Final Reader in {Tokenizer _iktokenizer = new Iktokenizer (in, This.usesmart ()); return new tokenstreamcomponents (_iktokenizer);}}

I added a ikanalyzersolrfactory, the code is as follows

Package Org.wltea.analyzer.lucene;import Java.io.reader;import Java.util.map;import Org.apache.lucene.analysis.tokenizer;import Org.apache.lucene.analysis.util.tokenizerfactory;import Org.apache.lucene.util.AttributeSource.AttributeFactory; public class Ikanalyzersolrfactory extends tokenizerfactory{         private Boolean usesmart;         public Boolean Usesmart () {        return usesmart;    }         public void Setusesmart (Boolean usesmart) {        this.usesmart = Usesmart;    }          Public ikanalyzersolrfactory (map<string,string> args) {         super (args);         Assurematchversion ();         This.setusesmart (Args.get ("Usesmart"). ToString (). Equals ("true"));      @Override Public    Tokenizer Create (attributefactory factory, Reader input) {        Tokenizer _iktokenizer = new Iktokenizer (input, this.usesmart);        return _iktokenizer;    } }

This allows you to configure the ikanalyzersolrfactory in the configuration file.

Here are the specific configuration descriptions:

1. Modify the IK jar file, add ikanalyzersolrfactory (if not change my QQ 632132852 ask me to)

2. Modify the Solrconfig.xml file to add

<lib dir= "/contrib/analysis-extras/lib" regex= ". *\.jar"/>

3. Modify the Schema.xml file to add

<!--IK word breakers--<fieldtype name= "Text_ik" class= "SOLR. TextField ">        <analyzer type=" index ">            <tokenizer class=" Org.wltea.analyzer.lucene.IKAnalyzerSolrFactory "usesmart=" true "/>        </analyzer>        <analyzer Type= "Query" >            <tokenizer class= "Org.wltea.analyzer.lucene.IKAnalyzerSolrFactory" usesmart= "true"/> <filter class= "SOLR. Synonymfilterfactory "synonyms=" Synonyms.txt "ignorecase=" true "expand=" true "/>          </analyzer>    < /fieldtype>

4. In the classes (no new) under SOLR webinfo, add some files in the IK compressed file, as follows:

5. Configure the custom thesaurus in Ext.dic, the words that do not need to be segmented are here, and the synonyms are written in synonyms.txt. Format: Notifications, announcements

Note that changing the thesaurus or synonym every time requires a restart of the service.

Original address: http://www.cnblogs.com/wudi521/p/5558880.html

IK word breaker integration solr4.7 with synonyms, segmentation words, stop words

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More