SOLR Chinese word breaker configuration explained (Ikanalyzer and mmseg4j)

Source: Internet
Author: User
Tags solr

1 Ikanalyzer word breaker configuration.

1.1 Copy Ikanalyzer2012_u6\ikanalyzer2012_u6.jar to C:\apache-tomcat-6.0.32\webapps\

Under the Solr\web-inf\lib folder

1.2 Create a new classes folder under the C:\apache-tomcat-6.0.32\webapps\solr\WEB-INF folder, copy the Ikanalyzer2012_u6\ IKAnalyzer.cfg.xml and Ikanalyzer2012_u6\stopword.dic to classes folder, modify IKAnalyzer.cfg.xml, add

<entry key= "Ext_dict" >ext.dic;</entry>

Classes under the new Ext.dic file, Ext.dic inside is added to add the extension words, stopword.dic inside is their own new stop words, some words are meaningless, so we have to filter it out, such as a and Ah Oh, After the modification to save the code as UTF-8 format, otherwise it does not effect,

1.3 Modify C:\solr\apache-solr-3.4.0\example\multicore\core0\conf\schema.xml file, new type Text_ik,title_search field type changed to Text_ik.

<!--the IK participle i added--

<fieldtype name= "Text_ik" class= "SOLR. TextField ">

<analyzer type= "index" ismaxwordlength= "false" class= "Org.wltea.analyzer.lucene.IKAnalyzer"/>

<analyzer type= "Query" ismaxwordlength= "true" class= "Org.wltea.analyzer.lucene.IKAnalyzer"/>

</fieldType>

<field name= "Title_search" type= "Text_ik" indexed= "true" stored= "true"/>

After 1.4 reindex SOLR data, you can view the word segmentation effect by querying.

1.5 Search football, get this data, participle success.

2 mmseg4j Word breaker configuration.

2.1 Copy all the jar files below the mmseg4j-1.8.5\dist to the C:\apache-tomcat-6.0.32\webapps\

Under the Solr\web-inf\lib folder

2.2 Copy data into C:\solr\apache-solr-3.4.0\example\multicore (peer with core file) and renamed DiC.

2.2.1 Chars.dic, is a single word, and the corresponding frequency, a pair of lines, the word in full, the frequency in the back, the middle with a space separate. The information for this file is used in the complex mode. Frequency information is used in the last worry rule. It has been packaged into a jar since version 1.5 and is generally not cared for. However, you can override it by delegating a file with the same name in the Thesaurus directory.

2.2.2 Units.dic, is the unit of words, such as: minutes, seconds, years. This document I joined after mmseg4j 1.6, but also one line. It is mainly in the unit information after the number of segmentation is good, not with the words in the words.dic confusion. At the same time also packaged into the jar, is still a trial, if you do not like it, you can use the empty file in the Thesaurus directory to cover it.

2.2.3 Words.dic, is the core of the thesaurus file, one line, do not need any other data (such as word length). Version 1.0 is a thesaurus of rmmseg (implemented with Ruby's mmseg). After version 1.5 mmseg4j use Sogou thesaurus, you can http://www.sogou.com/labs/dl/w.html find the download. Then I took it to the frequency information, and switched to UTF-8 encoding.

2.2.4 Words-my.dic is a custom thesaurus file (in fact mmseg4j can read words from multiple files). This feature was added in version 1.6. Its format is the same as words.dic, except that the xxx part is the name you write yourself, such as: Data/words-my.dic in the source package. Note: The custom thesaurus file name must be "words" as the prefix and ". DiC" as the suffix.

2.2.5 after modified to save the code as UTF-8 format, otherwise not effect,

2.3 Modify C:\solr\apache-solr-3.4.0\example\multicore\core0\conf\schema.xml file, new type Text_mmseg4j,title_sort field type changed to text _mmseg4j.

<!--mmseg4j Word breaker--

<fieldtype name= "text_mmseg4j" class= "SOLR. TextField ">

<analyzer type= "Index" >

<tokenizer class= "Com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode= "complex" dicpath= ". /dic "/><!--here is where the word breaker dictionary is located--

</analyzer>

<analyzer type= "Query" >

<tokenizer class= "Com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode= "complex" dicpath= ". /dic "/><!--here is where the word breaker dictionary is located--

</analyzer>

</fieldType>

<field name= "Title_sort" type= "text_mmseg4j" indexed= "true" stored= "true"/>

After 2.4 reindex SOLR data, you can view the word segmentation effect by querying.

2.5 Search football, get this data, participle success.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.