Lucene's Chinese word breaker Ikanalyzer

Source: Internet
Author: User

Word breaker support for English is very good.

General participle passing through the process:

1) Keyword Segmentation

2) Remove discontinued words

3) Convert English words to lowercase

But the word-breaker written by foreigners on Chinese participle is generally word participle, the effect of participle is not good.

The IK Analyzer written by the Chinese Lin Liangyi should be one of the best lucene word breakers, and is updated with the Lucene version updated to the IK Analyzer 2012 version.

IK Analyzer is an open-source, lightweight Chinese word breaker toolkit developed in the Java language. Up to now, IK has evolved into a common word breaker for Java, independent of lucene projects, while providing the default optimizations for Lucene. In the 2012 version, IK implements a simple word segmentation ambiguity elimination algorithm, which marks the derivation of the IK word breaker from the simple dictionary participle to the simulation semantic participle.

In the system environment: Core2 i7 3.4G Dual Core, 4G memory, window 7 64-bit, Sun JDK 1.6_29 64-bit normal PC environment test, IK2012 with 1.6 million words per second (3000kb/s) high-speed processing capability.

Specifically, in 2012 editions, dictionaries support Chinese, English, and digital mixed words.

  Example of the word segmentation effect for IK Analyzer version 2012:

  The IK Analyzer2012 version supports fine-grained slicing and smart slicing.

Let's take a look at two demo examples:

1) Original Text 1:

Ikanalyzer is an open source, lightweight Chinese word breaker toolkit based on Java language development. Since the launch of the 1.0 release in December 2006, Ikanalyzer has launched 3 large versions.

Intelligent Segmentation Results:

Ikanalyzer | is | A | Open Source | of | Based on | java | language | Development | of | Lightweight | of | English | participle | Tool Kits | from | 2006 | December | Launch | Version 1.0 | Start | Ikanalyzer | Already | Push | Out of | 3 x | Big | Version

Maximum granularity of Word segmentation results:

Ikanalyzer | is | A | One | A | Open Source | of | Based on | java | language | Development | of | Lightweight | Volume level | of | English | participle | Tool Kits | Tools | Package | from | 2006 | Year | 12 | Month | Launch | 1.0 | Edition | Start | Ikanalyzer | Already | Launch | Out of | 3 | A | Big | Version

2) Original Text 2:

Zhang San said it was true.

Intelligent Segmentation Results:

Zhang San | Speak of | true | Reasonable

Maximum granularity of Word segmentation results:

Zhang San | Three | Speak of | true | of | true | Reality | Reasonable

 Use of Ikanalyzer

  1):

Googlecode Open Source project: http://code.google.com/p/ik-analyzer/

Googlecode:http://code.google.com/p/ik-analyzer/downloads/list

2) Compatibility:

Ikanalyzer version 2012 is compatible with Lucene3.3 or later.

3) Installation Deployment:

Very simple, just need to introduce ikanalyzer2012.jar into the project. It has a dictionary stopword.dic for stop words like "," "and" ". Copy the Stopword.dic and IKAnalyzer.cfg.xml to the class root to enable the Disable Word feature and expand your own dictionary.

4) Test Examples:

Create a new Java Project, Introduce the jar and Ikanalyzer2012.jar files required for Lucene, copy the Stopword.dic and IKAnalyzer.cfg.xml to the class root directory, and create an extended dictionary ext.dic and Chinese stop word dictionaries Chinese _stopword.dic.

The IKAnalyzer2012 release package comes with a stopword.dic that is in English. So we created a new chinese_stopword.dic to store Chinese stop words. Chinese_stopword.dic need to use UTF-8 encoding. In the dictionary, each Chinese word is exclusively one line.

Chinese_stopword.dic content Format:

IKAnalyzer.cfg.xml:

<?xml version= "1.0" encoding= "UTF-8"?> <! DOCTYPE Properties SYSTEM "Http://java.sun.com/dtd/properties.dtd" > <properties> <comment>ik Analyz ER extended configuration </comment> <!--users can configure their own extension dictionaries here--<entry key= "Ext_dict" >ext.dic;</entry> &L t;! --user can configure their extension stop word dictionary here--<entry key= "Ext_stopwords" >stopword.dic;chinese_stopword.dic</entry> &lt ;/properties>

Multiple dictionary files can be configured, files using ";" separated by numbers. The file path is the starting root path of the relative Java package.

The extended dictionary ext.dic needs to be encoded for UTF-8.

Ext.dic content:

I put "2012" as a word, "UEFA Cup four" as a word.

To test the participle code:

Package com.cndatacom.lucene.test;

Import Java.io.StringReader;

Import Org.apache.lucene.analysis.Analyzer;

Import Org.apache.lucene.analysis.TokenStream;

Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

Import Org.junit.Test;

Import Org.wltea.analyzer.lucene.IKAnalyzer;

public class Ikanalyzertest {

@Test

public void Testikanalyzer () throws Exception {

String KeyWord = "2012 European Championship Four";

Ikanalyzer Analyzer = new Ikanalyzer ();

Using intelligent participle

Analyzer.setusesmart (TRUE);

Print word breaker results

Printanalysisresult (Analyzer,keyword);

}

private void Printanalysisresult (Analyzer analyzer, String KeyWord) throws Exception {

System.out.println ("word breaker currently in use:" + Analyzer.getclass (). Getsimplename ());

Tokenstream Tokenstream = Analyzer.tokenstream ("Content", new StringReader (KeyWord));

Tokenstream.addattribute (Chartermattribute.class);

while (Tokenstream.incrementtoken ()) {

Chartermattribute Chartermattribute = Tokenstream.getattribute (Chartermattribute.class);

System.out.println (New String (Chartermattribute.buffer ()));

}

}

}

Print out the results of the participle:

Can see "2012" as a word, "UEFA Cup Four" also as a word, the discontinuation of the word "year" has been filtered out.

Lucene's Chinese word breaker Ikanalyzer

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.