SOLR Chinese Word Segmentation

Last Update:2018-12-03 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I tried the following three open-source Chinese Word divider in SOLR, two of which were not available because the SOLR version was too high. I finally decompiled the jar package and found the reason, the following briefly describes three open source Chinese Word splitters.

Ding jieniu: The last code submission on Google Code was 2008.6 months. It was not very active, but many people were using it.

Mmseg4j: The last code submission on Google Code was 2010.12 months. It should be fairly active. The mmseg algorithm is used. There are two word segmentation methods: simple and complex.

Ikanalyzer: it has been very active recently. In 2011.3, a version was submitted on Google Code.

Lucene released version May this year in 3.2, and SOLR also corresponds to version 3.2. A bad thing about the higher version is that the open-source Chinese Dictionary cannot keep up with the corresponding update speed, I am using version 3.1. if you add the Ding Jie Niu Chinese Word divider and the latest ikanalyzer version to Lucene, an error will be reported.

The cause of the error is as follows (taking ikanalyzer as an example ):

Whether it is Ding Jie Niu or ikanalyzer, to put the word divider into SOLR, You need to inherit the basetokenizerfactory class in SOLR,

Import Java. io. reader; Import Org. apache. lucene. analysis. tokenstream; Import Org. apache. SOLR. analysis. basetokenizerfactory; Import Org. wltea. analyzer. lucene. ikanalyzer; public class chinesetokenizerfactory extends basetokenizerfactory { @ override Public tokenstream create (Reader reader) { return New ikanalyzer (). tokenstream ("text", Reader); } }

When the tokenizerfactory interface is implemented in this base class, create is defined in this interface, but the returned type is tokenizer. In solr3.1, tokenizer inherits tokenstream, therefore, forced conversion is required to avoid errors. Ding Jie Niu is not that simple. You need to modify the source code. Ding Jie Niu now only supports solr1.4.

In addition, Ding Jie Niu cannot be directly used in javase3.1. The Code does not prompt any errors, but an error is reported when it is run. I don't know why. It is estimated that the cause is the same as above. You need to modify the source code, please tell

The latest version of mmseg4j is required. Otherwise, an error is reported. The specific configuration is as follows:

Put the mmseg4j-all-1.8.4.jar in Tomcat/webapps/SOLR/lib, mmseg4j1.84 package dictionary extracted, put in SOLR. Home/data directory, modify SOLR configuration file:

Mmseg4j mainly supports two parameters in SOLR: mode and dicpath. Mode indicates the mode word segmentation. Dicpath is a dictionary Directory, which can be searched under the current data directory by default. It seems that it is not feasible after testing. An absolute path must be provided manually. Maybe it is a high version problem, it may be where I set the error, and then at http: // localhost: 8080/SOLR/admin/analysis. JSP can view the word segmentation effect of mmseg4j. Select type from the drop-down menu of field and enter textcomplex, especially for comparison with CJK word segmentation, CJK is an officially provided SOLR tokenizer that supports China, Japan, and South Korea. It uses binary word segmentation for Chinese.

In fact, Chinese word segmentation has always been something that many people are studying. How to improve word segmentation efficiency and matching accuracy is the goal. The algorithm implementation in it is the core of it and completely thorough understanding of it, I guess I can write all my papers. Thanks to the limited time, I just gave it a rough experience. There is also a topic worth studying about SOLR/Lucene's search efficiency and index optimization.

References:

1. http://blog.chenlb.com/2009/04/solr-chinese-segment-mmseg4j-use-demo.html

2. http://lianj-lee.iteye.com/blog/464364

3. http://www.blogjava.net/RongHao/archive/2007/11/06/158621.html

4. http://www.iteye.com/news/9637

5. http://blog.csdn.net/foamflower/archive/2010/07/09/5723361.aspx

It was suddenly found that ikanalyzer has implemented support for SOLR tokenizerfactory interface configuration in version 3.1.5. For details, see the following article:

Http://linliangyi2007.iteye.com/blog/501228

E3.0.2 is supported and the source code needs to be modified:

Http://blog.csdn.net/foamflower/archive/2010/07/09/5723361.aspx

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More