SOLR Chinese Word Segmentation

Source: Internet
Author: User
Tags solr

I tried the following three open-source Chinese Word divider in SOLR, two of which were not available because the SOLR version was too high. I finally decompiled the jar package and found the reason, the following briefly describes three open source Chinese Word splitters.

 

Ding jieniu: The last code submission on Google Code was 2008.6 months. It was not very active, but many people were using it.

Mmseg4j: The last code submission on Google Code was 2010.12 months. It should be fairly active. The mmseg algorithm is used. There are two word segmentation methods: simple and complex.

Ikanalyzer: it has been very active recently. In 2011.3, a version was submitted on Google Code.

 

Lucene released version May this year in 3.2, and SOLR also corresponds to version 3.2. A bad thing about the higher version is that the open-source Chinese Dictionary cannot keep up with the corresponding update speed, I am using version 3.1. if you add the Ding Jie Niu Chinese Word divider and the latest ikanalyzer version to Lucene, an error will be reported.

 

The cause of the error is as follows (taking ikanalyzer as an example ):

Whether it is Ding Jie Niu or ikanalyzer, to put the word divider into SOLR, You need to inherit the basetokenizerfactory class in SOLR,

Import Java. io. reader; </P> <p> Import Org. apache. lucene. analysis. tokenstream; <br/> Import Org. apache. SOLR. analysis. basetokenizerfactory; <br/> Import Org. wltea. analyzer. lucene. ikanalyzer; </P> <p> public class chinesetokenizerfactory extends basetokenizerfactory {</P> <p> @ override <br/> Public tokenstream create (Reader reader) {<br/> return New ikanalyzer (). tokenstream ("text", Reader); <br/>}</P> <p>}

When the tokenizerfactory interface is implemented in this base class, create is defined in this interface, but the returned type is tokenizer. In solr3.1, tokenizer inherits tokenstream, therefore, forced conversion is required to avoid errors. Ding Jie Niu is not that simple. You need to modify the source code. Ding Jie Niu now only supports solr1.4.

 

In addition, Ding Jie Niu cannot be directly used in javase3.1. The Code does not prompt any errors, but an error is reported when it is run. I don't know why. It is estimated that the cause is the same as above. You need to modify the source code, please tell

 

The latest version of mmseg4j is required. Otherwise, an error is reported. The specific configuration is as follows:

Put the mmseg4j-all-1.8.4.jar in Tomcat/webapps/SOLR/lib, mmseg4j1.84 package dictionary extracted, put in SOLR. Home/data directory, modify SOLR configuration file:

<Fieldtype name = "textcomplex" class = "SOLR. textfield "positionincrementgap =" 100 "> <br/> <analyzer> <br/> <tokenizer class =" com. chenlb. mmseg4j. SOLR. mmsegtokenizerfactory "mode =" complex "dicpath =" C:/Apache/apache-solr-3.1.0/example/SOLR/Data "/> <br/> <filter class =" SOLR. lowercasefilterfactory "/> <br/> </Analyzer> <br/> </fieldtype> <br/> <fieldtype name =" textmaxword "class =" SOLR. textfield "positionincrementgap =" 100 "> <br/> <analyzer> <br/> <tokenizer class =" com. chenlb. mmseg4j. SOLR. mmsegtokenizerfactory "mode =" Max-word "dicpath =" C: // Apache/apache-solr-3.1.0/example/SOLR/Data "/> <br/> <filter class =" SOLR. lowercasefilterfactory "/> <br/> </Analyzer> <br/> </fieldtype> <br/> <fieldtype name =" textsimple "class =" SOLR. textfield "positionincrementgap =" 100 "> <br/> <analyzer> <br/> <tokenizer class =" com. chenlb. mmseg4j. SOLR. mmsegtokenizerfactory "mode =" simple "dicpath =" C:/Apache/apache-solr-3.1.0/example/SOLR/Data "/> <br/> <filter class =" SOLR. lowercasefilterfactory "/> <br/> </Analyzer> <br/> </fieldtype> <br/>

Mmseg4j mainly supports two parameters in SOLR: mode and dicpath. Mode indicates the mode word segmentation. Dicpath is a dictionary Directory, which can be searched under the current data directory by default. It seems that it is not feasible after testing. An absolute path must be provided manually. Maybe it is a high version problem, it may be where I set the error, and then at http: // localhost: 8080/SOLR/admin/analysis. JSP can view the word segmentation effect of mmseg4j. Select type from the drop-down menu of field and enter textcomplex, especially for comparison with CJK word segmentation, CJK is an officially provided SOLR tokenizer that supports China, Japan, and South Korea. It uses binary word segmentation for Chinese.

 

 

In fact, Chinese word segmentation has always been something that many people are studying. How to improve word segmentation efficiency and matching accuracy is the goal. The algorithm implementation in it is the core of it and completely thorough understanding of it, I guess I can write all my papers. Thanks to the limited time, I just gave it a rough experience. There is also a topic worth studying about SOLR/Lucene's search efficiency and index optimization.

 

References:

1. http://blog.chenlb.com/2009/04/solr-chinese-segment-mmseg4j-use-demo.html

2. http://lianj-lee.iteye.com/blog/464364

3. http://www.blogjava.net/RongHao/archive/2007/11/06/158621.html

4. http://www.iteye.com/news/9637

5. http://blog.csdn.net/foamflower/archive/2010/07/09/5723361.aspx

It was suddenly found that ikanalyzer has implemented support for SOLR tokenizerfactory interface configuration in version 3.1.5. For details, see the following article:

Http://linliangyi2007.iteye.com/blog/501228

 

E3.0.2 is supported and the source code needs to be modified:

Http://blog.csdn.net/foamflower/archive/2010/07/09/5723361.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.