I tried the following three open-source Chinese Word divider in SOLR, two of which were not available because the SOLR version was too high. I finally decompiled the jar package and found the reason, the following briefly describes three open source Chinese Word splitters.
Ding jieniu: The last code submission on Google Code was 2008.6 months. It was not very active, but many people were using it.
Mmseg4j: The last code submission on Google Code was 2010.12 months. It should be fairly active. The mmseg algorithm is used. There are two word segmentation methods: simple and complex.
Ikanalyzer: it has been very active recently. In 2011.3, a version was submitted on Google Code.
Lucene released version May this year in 3.2, and SOLR also corresponds to version 3.2. A bad thing about the higher version is that the open-source Chinese Dictionary cannot keep up with the corresponding update speed, I am using version 3.1. if you add the Ding Jie Niu Chinese Word divider and the latest ikanalyzer version to Lucene, an error will be reported.
The cause of the error is as follows (taking ikanalyzer as an example ):
Whether it is Ding Jie Niu or ikanalyzer, to put the word divider into SOLR, You need to inherit the basetokenizerfactory class in SOLR,
Import Java. io. reader; </P> <p> Import Org. apache. lucene. analysis. tokenstream; <br/> Import Org. apache. SOLR. analysis. basetokenizerfactory; <br/> Import Org. wltea. analyzer. lucene. ikanalyzer; </P> <p> public class chinesetokenizerfactory extends basetokenizerfactory {</P> <p> @ override <br/> Public tokenstream create (Reader reader) {<br/> return New ikanalyzer (). tokenstream ("text", Reader); <br/>}</P> <p>}
When the tokenizerfactory interface is implemented in this base class, create is defined in this interface, but the returned type is tokenizer. In solr3.1, tokenizer inherits tokenstream, therefore, forced conversion is required to avoid errors. Ding Jie Niu is not that simple. You need to modify the source code. Ding Jie Niu now only supports solr1.4.
In addition, Ding Jie Niu cannot be directly used in javase3.1. The Code does not prompt any errors, but an error is reported when it is run. I don't know why. It is estimated that the cause is the same as above. You need to modify the source code, please tell
The latest version of mmseg4j is required. Otherwise, an error is reported. The specific configuration is as follows:
Put the mmseg4j-all-1.8.4.jar in Tomcat/webapps/SOLR/lib, mmseg4j1.84 package dictionary extracted, put in SOLR. Home/data directory, modify SOLR configuration file:
<Fieldtype name = "textcomplex" class = "SOLR. textfield "positionincrementgap =" 100 "> <br/> <analyzer> <br/> <tokenizer class =" com. chenlb. mmseg4j. SOLR. mmsegtokenizerfactory "mode =" complex "dicpath =" C:/Apache/apache-solr-3.1.0/example/SOLR/Data "/> <br/> <filter class =" SOLR. lowercasefilterfactory "/> <br/> </Analyzer> <br/> </fieldtype> <br/> <fieldtype name =" textmaxword "class =" SOLR. textfield "positionincrementgap =" 100 "> <br/> <analyzer> <br/> <tokenizer class =" com. chenlb. mmseg4j. SOLR. mmsegtokenizerfactory "mode =" Max-word "dicpath =" C: // Apache/apache-solr-3.1.0/example/SOLR/Data "/> <br/> <filter class =" SOLR. lowercasefilterfactory "/> <br/> </Analyzer> <br/> </fieldtype> <br/> <fieldtype name =" textsimple "class =" SOLR. textfield "positionincrementgap =" 100 "> <br/> <analyzer> <br/> <tokenizer class =" com. chenlb. mmseg4j. SOLR. mmsegtokenizerfactory "mode =" simple "dicpath =" C:/Apache/apache-solr-3.1.0/example/SOLR/Data "/> <br/> <filter class =" SOLR. lowercasefilterfactory "/> <br/> </Analyzer> <br/> </fieldtype> <br/>
Mmseg4j mainly supports two parameters in SOLR: mode and dicpath. Mode indicates the mode word segmentation. Dicpath is a dictionary Directory, which can be searched under the current data directory by default. It seems that it is not feasible after testing. An absolute path must be provided manually. Maybe it is a high version problem, it may be where I set the error, and then at http: // localhost: 8080/SOLR/admin/analysis. JSP can view the word segmentation effect of mmseg4j. Select type from the drop-down menu of field and enter textcomplex, especially for comparison with CJK word segmentation, CJK is an officially provided SOLR tokenizer that supports China, Japan, and South Korea. It uses binary word segmentation for Chinese.
In fact, Chinese word segmentation has always been something that many people are studying. How to improve word segmentation efficiency and matching accuracy is the goal. The algorithm implementation in it is the core of it and completely thorough understanding of it, I guess I can write all my papers. Thanks to the limited time, I just gave it a rough experience. There is also a topic worth studying about SOLR/Lucene's search efficiency and index optimization.
References:
1. http://blog.chenlb.com/2009/04/solr-chinese-segment-mmseg4j-use-demo.html
2. http://lianj-lee.iteye.com/blog/464364
3. http://www.blogjava.net/RongHao/archive/2007/11/06/158621.html
4. http://www.iteye.com/news/9637
5. http://blog.csdn.net/foamflower/archive/2010/07/09/5723361.aspx
It was suddenly found that ikanalyzer has implemented support for SOLR tokenizerfactory interface configuration in version 3.1.5. For details, see the following article:
Http://linliangyi2007.iteye.com/blog/501228
E3.0.2 is supported and the source code needs to be modified:
Http://blog.csdn.net/foamflower/archive/2010/07/09/5723361.aspx