Solr6.5 configure Chinese Word Segmentation IKAnalyzer and pinyin word segmentation pinyinAnalyzer (2), solrikanalyzer Configuration

Last Update:2017-04-02 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Solr6.5 configure Chinese Word Segmentation IKAnalyzer and pinyin word segmentation pinyinAnalyzer (2), solrikanalyzer Configuration

Previously, the installation and configuration of Solr6.5 on Centos6 (I) introduced solr6.5 installation. This article mainly introduces how to create Solr Core and configure Chinese IKAnalyzer word segmentation and pinyin search.

1. Create a Core:

1. First, create the mycore directory in solrhome (see Solr6.5 installation and configuration on Centos6 (1) solr web. xml;

[root@localhost down]# [root@localhost down]# mkdir /down/apache-tomcat-8.5.12/solrhome/mycore[root@localhost down]# cd /down/apache-tomcat-8.5.12/solrhome/mycore

[root@localhost mycore]#

2. Copy all files under solr-6.5.0 \ example-DIH \ solr to the/down/apache-tomcat-8.5.12/solrhome/mycore directory:

[root@localhost mycore]# cp -R /down/solr-6.5.0/example/example-DIH/solr/solr/* ./[root@localhost mycore]# lsconf  core.properties[root@localhost mycore]#

3. Restart tomcat;

[root@localhost down]# /down/apache-tomcat-8.5.12/bin/shutdown.sh[root@localhost down]# /down/apache-tomcat-8.5.12/bin/startup.sh

4. Enter http: // localhost: 8080/solr/index.html in the browser to display the Solr management interface.

2. Configure the Chinese word segmentation that comes with solr:

1. Configure solr6.5 with Chinese word segmentation. Copy the solr-6.5.0/contrib/analysis-extras/lucene-libs/lucene-analyzers-smartcn-6.5.0.jar to the apache-tomcat-8.5.12/webapps/solr/WEB-INF/lib/directory.

[root@localhost down]# cp /down/solr-6.5.0/contrib/analysis-extras/lucene-libs/lucene-analyzers-smartcn-6.5.0.jar /down/apache-tomcat-8.5.12/webapps/solr/WEB-INF/lib/

2. Added support for Chinese word segmentation for core. Edit the managed-schema file under conf in mycore.

[root@localhost conf]# cd /down/apache-tomcat-8.5.12/solrhome/mycore/conf[root@localhost conf]# vi managed-schema

Add

<fieldType name="text_smartcn" class="solr.TextField" positionIncrementGap="0">    <analyzer type="index">      <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/>    </analyzer>    <analyzer type="query">       <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/>    </analyzer></fieldType>

Restart tomcat and enter http: // localhost: 8080/solr/index.html #/mycore/analysis in the browser.

Enter some Chinese characters in the Field Value (Index) text box, and then Analyse Fieldname/FieldType: Select text_smartcn to view the effect of Chinese word segmentation.

3. Configure the Chinese word segmentation of IKAnalyzer:

1. DownloadIKAnalyzerThis is the latest solr6.5.

After decompression, there will be four files.

[root@localhost ikanalyzer-solr5]# lsext.dic  IKAnalyzer.cfg.xml  ik-analyzer-solr5-5.x.jar  stopword.dic

Ext. dic is the extended dictionary, stopword. dic is the Stop Word Dictionary, IKAnalyzer. cfg. xml is the configuration file, the ik-analyzer-solr5-5.x.jar is the word segmentation jar package.

2. Run IKAnalyzer in the folder. cfg. xml, ext. dic And stopword. copy the three dic files to the/webapps/solr/WEB-INF/classes directory and modify IKAnalyzer. cfg. xml

[root@localhost ikanalyzer-solr5]# cp ext.dic IKAnalyzer.cfg.xml stopword.dic /down/apache-tomcat-8.5.12/webapps/solr/WEB-INF/classes/

<? Xml version = "1.0" encoding = "UTF-8"?> <! DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment> IK Analyzer Extension Configuration </comment> <! -- You can configure your own extended dictionary here --> <entry key = "ext_dict"> ext. dic; </entry> <! -- You can configure your own extended stopword dictionary here --> <entry key = "ext_stopwords"> stopword. dic; </entry> </properties>

3. add your own extended dictionary in ext. dic.

4. Copy the ik-analyzer-solr5-5.x.jar to the/down/apache-tomcat-8.5.12/webapps/solr/WEB-INF/lib/directory.

[root@localhost down]# cp /down/ikanalyzer-solr5/ik-analyzer-solr5-5.x.jar /down/apache-tomcat-8.5.12/webapps/solr/WEB-INF/lib/

5. Add the following configuration before the solrhome \ mycore \ conf \ managed-schema file </schema>.

<! -- IK word segmentation I added --> <fieldType name = "text_ik" class = "solr. textField "> <analyzer type =" index "isMaxWordLength =" false "class =" org. wltea. analyzer. lucene. IKAnalyzer "/> <analyzer type =" query "isMaxWordLength =" true "class =" org. wltea. analyzer. lucene. IKAnalyzer "/> </fieldType>

Note: Remember to encode stopword. dic, ext. dic as a UTF-8 without BOM.

Restart tomcat to view the word segmentation effect.

4. Configure pinyin search:

1, preparation, need to use pinyin4j-2.5.0.jar, pinyinAnalyzer. jar these two jar packages ,.

2. Copy the pinyin4j-2.5.0.jar and pinyinAnalyzer. jar packages to the/down/apache-tomcat-8.5.12/webapps/solr/WEB-INF/lib/directory.

[root@localhost down]# cp pinyin4j-2.5.0.jar pinyinAnalyzer4.3.1.jar /down/apache-tomcat-8.5.12/webapps/solr/WEB-INF/lib/

3. Add the following configuration before the solrhome \ mycore \ conf \ managed-schema file </schema>:

<fieldType name="text_pinyin" class="solr.TextField" positionIncrementGap="0">    <analyzer type="index">        <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/>        <filter class="com.shentong.search.analyzers.PinyinTransformTokenFilterFactory" minTermLenght="2" />        <filter class="com.shentong.search.analyzers.PinyinNGramTokenFilterFactory" minGram="1" maxGram="20" />    </analyzer>    <analyzer type="query">        <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/>        <filter class="com.shentong.search.analyzers.PinyinTransformTokenFilterFactory" minTermLenght="2" />        <filter class="com.shentong.search.analyzers.PinyinNGramTokenFilterFactory" minGram="1" maxGram="20" />    </analyzer></fieldType>

Restart tomcat to view the pinyin search results.

Here we use the Chinese word segmentation and pinyin4j provided by solr.

Related file:

Ikanalyzer-solr5.zip

Pinyin.zip

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More