Word breaker support for English is very good.
General participle passing through the process:
1) Keyword Segmentation
2) Remove discontinued words
3) Convert English words to lowercase
But the word-breaker written by foreigners on Chinese participle is generally word participle, the effect of participle is not good.
The IK Analyzer written by the Chinese Lin Liangyi should be one of the best lucene word breakers, and is updated with the Lucene version updated to the IK Analyzer 2012 version.
IK Analyzer is an open-source, lightweight Chinese word breaker toolkit developed in the Java language. Up to now, IK has evolved into a common word breaker for Java, independent of lucene projects, while providing the default optimizations for Lucene. In the 2012 version, IK implements a simple word segmentation ambiguity elimination algorithm, which marks the derivation of the IK word breaker from the simple dictionary participle to the simulation semantic participle.
In the system environment: Core2 i7 3.4G Dual Core, 4G memory, window 7 64-bit, Sun JDK 1.6_29 64-bit normal PC environment test, IK2012 with 1.6 million words per second (3000kb/s) high-speed processing capability.
Specifically, in 2012 editions, dictionaries support Chinese, English, and digital mixed words.
Example of the word segmentation effect for IK Analyzer version 2012:
The IK Analyzer2012 version supports fine-grained slicing and smart slicing.
Let's take a look at two demo examples:
1) Original Text 1:
Ikanalyzer is an open source, lightweight Chinese word breaker toolkit based on Java language development. Since the launch of the 1.0 release in December 2006, Ikanalyzer has launched 3 large versions.
Intelligent Segmentation Results:
Ikanalyzer | is | A | Open Source | of | Based on | java | language | Development | of | Lightweight | of | English | participle | Tool Kits | from | 2006 | December | Launch | Version 1.0 | Start | Ikanalyzer | Already | Push | Out of | 3 x | Big | Version
Maximum granularity of Word segmentation results:
Ikanalyzer | is | A | One | A | Open Source | of | Based on | java | language | Development | of | Lightweight | Volume level | of | English | participle | Tool Kits | Tools | Package | from | 2006 | Year | 12 | Month | Launch | 1.0 | Edition | Start | Ikanalyzer | Already | Launch | Out of | 3 | A | Big | Version
2) Original Text 2:
Zhang San said it was true.
Intelligent Segmentation Results:
Zhang San | Speak of | true | Reasonable
Maximum granularity of Word segmentation results:
Zhang San | Three | Speak of | true | of | true | Reality | Reasonable
Use of Ikanalyzer
1):
Googlecode Open Source project: http://code.google.com/p/ik-analyzer/
Googlecode:http://code.google.com/p/ik-analyzer/downloads/list
2) Compatibility:
Ikanalyzer version 2012 is compatible with Lucene3.3 or later.
3) Installation Deployment:
Very simple, just need to introduce ikanalyzer2012.jar into the project. It has a dictionary stopword.dic for stop words like "," "and" ". Copy the Stopword.dic and IKAnalyzer.cfg.xml to the class root to enable the Disable Word feature and expand your own dictionary.
4) Test Examples:
Create a new Java Project, Introduce the jar and Ikanalyzer2012.jar files required for Lucene, copy the Stopword.dic and IKAnalyzer.cfg.xml to the class root directory, and create an extended dictionary ext.dic and Chinese stop word dictionaries Chinese _stopword.dic.
The IKAnalyzer2012 release package comes with a stopword.dic that is in English. So we created a new chinese_stopword.dic to store Chinese stop words. Chinese_stopword.dic need to use UTF-8 encoding. In the dictionary, each Chinese word is exclusively one line.
Chinese_stopword.dic content Format:
IKAnalyzer.cfg.xml:
<?xml version= "1.0" encoding= "UTF-8"?> <! DOCTYPE Properties SYSTEM "Http://java.sun.com/dtd/properties.dtd" > <properties> <comment>ik Analyz ER extended configuration </comment> <!--users can configure their own extension dictionaries here--<entry key= "Ext_dict" >ext.dic;</entry> &L t;! --user can configure their extension stop word dictionary here--<entry key= "Ext_stopwords" >stopword.dic;chinese_stopword.dic</entry> < ;/properties>
Multiple dictionary files can be configured, files using ";" separated by numbers. The file path is the starting root path of the relative Java package.
The extended dictionary ext.dic needs to be encoded for UTF-8.
Ext.dic content:
I put "2012" as a word, "UEFA Cup four" as a word.
To test the participle code:
Package com.cndatacom.lucene.test;
Import Java.io.StringReader;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.junit.Test;
Import Org.wltea.analyzer.lucene.IKAnalyzer;
public class Ikanalyzertest {
@Test
public void Testikanalyzer () throws Exception {
String KeyWord = "2012 European Championship Four";
Ikanalyzer Analyzer = new Ikanalyzer ();
Using intelligent participle
Analyzer.setusesmart (TRUE);
Print word breaker results
Printanalysisresult (Analyzer,keyword);
}
private void Printanalysisresult (Analyzer analyzer, String KeyWord) throws Exception {
System.out.println ("word breaker currently in use:" + Analyzer.getclass (). Getsimplename ());
Tokenstream Tokenstream = Analyzer.tokenstream ("Content", new StringReader (KeyWord));
Tokenstream.addattribute (Chartermattribute.class);
while (Tokenstream.incrementtoken ()) {
Chartermattribute Chartermattribute = Tokenstream.getattribute (Chartermattribute.class);
System.out.println (New String (Chartermattribute.buffer ()));
}
}
}
Print out the results of the participle:
Can see "2012" as a word, "UEFA Cup Four" also as a word, the discontinuation of the word "year" has been filtered out.
Lucene's Chinese word breaker Ikanalyzer