SOLR Chinese word breaker configuration explained (Ikanalyzer and mmseg4j)

Last Update:2018-08-01 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Ikanalyzer word breaker configuration.

1.1 Copy Ikanalyzer2012_u6\ikanalyzer2012_u6.jar to C:\apache-tomcat-6.0.32\webapps\

Under the Solr\web-inf\lib folder

1.2 Create a new classes folder under the C:\apache-tomcat-6.0.32\webapps\solr\WEB-INF folder, copy the Ikanalyzer2012_u6\ IKAnalyzer.cfg.xml and Ikanalyzer2012_u6\stopword.dic to classes folder, modify IKAnalyzer.cfg.xml, add

Classes under the new Ext.dic file, Ext.dic inside is added to add the extension words, stopword.dic inside is their own new stop words, some words are meaningless, so we have to filter it out, such as a and Ah Oh, After the modification to save the code as UTF-8 format, otherwise it does not effect,

1.3 Modify C:\solr\apache-solr-3.4.0\example\multicore\core0\conf\schema.xml file, new type Text_ik,title_search field type changed to Text_ik.

<!--the IK participle i added--

</fieldType>

After 1.4 reindex SOLR data, you can view the word segmentation effect by querying.

1.5 Search football, get this data, participle success.

2 mmseg4j Word breaker configuration.

2.1 Copy all the jar files below the mmseg4j-1.8.5\dist to the C:\apache-tomcat-6.0.32\webapps\

Under the Solr\web-inf\lib folder

2.2 Copy data into C:\solr\apache-solr-3.4.0\example\multicore (peer with core file) and renamed DiC.

2.2.1 Chars.dic, is a single word, and the corresponding frequency, a pair of lines, the word in full, the frequency in the back, the middle with a space separate. The information for this file is used in the complex mode. Frequency information is used in the last worry rule. It has been packaged into a jar since version 1.5 and is generally not cared for. However, you can override it by delegating a file with the same name in the Thesaurus directory.

2.2.2 Units.dic, is the unit of words, such as: minutes, seconds, years. This document I joined after mmseg4j 1.6, but also one line. It is mainly in the unit information after the number of segmentation is good, not with the words in the words.dic confusion. At the same time also packaged into the jar, is still a trial, if you do not like it, you can use the empty file in the Thesaurus directory to cover it.

2.2.3 Words.dic, is the core of the thesaurus file, one line, do not need any other data (such as word length). Version 1.0 is a thesaurus of rmmseg (implemented with Ruby's mmseg). After version 1.5 mmseg4j use Sogou thesaurus, you can http://www.sogou.com/labs/dl/w.html find the download. Then I took it to the frequency information, and switched to UTF-8 encoding.

2.2.4 Words-my.dic is a custom thesaurus file (in fact mmseg4j can read words from multiple files). This feature was added in version 1.6. Its format is the same as words.dic, except that the xxx part is the name you write yourself, such as: Data/words-my.dic in the source package. Note: The custom thesaurus file name must be "words" as the prefix and ". DiC" as the suffix.

2.2.5 after modified to save the code as UTF-8 format, otherwise not effect,

2.3 Modify C:\solr\apache-solr-3.4.0\example\multicore\core0\conf\schema.xml file, new type Text_mmseg4j,title_sort field type changed to text _mmseg4j.

<!--mmseg4j Word breaker--

<tokenizer class= "Com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode= "complex" dicpath= ". /dic "/><!--here is where the word breaker dictionary is located--

</analyzer>

<tokenizer class= "Com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode= "complex" dicpath= ". /dic "/><!--here is where the word breaker dictionary is located--

</analyzer>

</fieldType>

After 2.4 reindex SOLR data, you can view the word segmentation effect by querying.

2.5 Search football, get this data, participle success.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More