Lucene Chinese word segmentation ik Analyzer

Source: Internet
Author: User
IK Analyzer is an open-source, lightweight Chinese word breaker toolkit developed in the Java language. Starting with the release of the 1.0 version in December 2006, Ikanalyzer has launched 4 large versions. Initially, it is an open source project Luence as the application of the main, combined with the dictionary word segmentation and Grammar analysis algorithm in Chinese language sub-phrase pieces. Starting with the 3.0 release, IK evolved into a common Java-oriented word breaker, independent of the Lucene project, while providing the default optimizations for Lucene. In the 2012 version, IK implements a simple word segmentation ambiguity elimination algorithm, which marks the derivation of the IK word breaker from the simple dictionary participle to the simulation semantic participle.
IK Analyzer 2012 Features:
1. Adopt the unique "forward iteration of the most fine-grained segmentation algorithm", to support fine-grained and intelligent segmentation of two types of segmenting mode;
2. In the system environment: Core2 i7 3.4G Dual Core, 4G memory, window 7 64 bit, Sun JDK 1.6_29 64-bit normal PC environment test, IK2012 with 1.6 million words/second (3000KB/S) high-speed processing capability.
The 3.2012 version of the intelligent word breaker supports simple word segmentation and word-count merge output.
4. Adopt multi-sub-processor analysis mode, support: English alphabet, numerals, Chinese words word-processing, compatible with Korean, Japanese characters
5. Optimized dictionary storage, smaller memory footprint. Supports user dictionary extension definitions. Specifically, in 2012 editions, dictionaries support Chinese, English, and digital mixed words.
Project website:
Google:https://code.google.com/p/ik-analyzer

Github:https://github.com/wks/ik-analyzer

Https://github.com/linvar/IKAnalyzer
Since IK Analyzer is not currently compatible with the latest version of Luence, in the IK analyer FF hotfix 1 full bundle, Support LUCENE4.0/SOLR 4.0, below we are lucene4 and the latest version of IK Analyzer demonstrates Chinese participle as an example.
To download the latest https://ik-analyzer.googlecode.com/files/IK%20Analyzer%202012FF_hf1.zip, place the following files under Classpath:
Stopwords.dic is a deactivated Word file
Ext.dic a custom dictionary
IKAnalyzer.cfg.xml participle expansion profile (mainly used to tune Stopwords.dic, Ext.dic)

Add the Ikanalyzer2012ff_u1.jar to the project Lib:

<properties> <lucene.version>4.10.4</lucene.version> </properties> <dependencies> & lt;dependency> <groupId> org.apache.lucene</groupid> <artifactId> lucene-core</artifactid&
			Gt <version> ${lucene.version}</version> </dependency> <dependency> <groupId> Org.apache .lucene</groupid> <artifactId> lucene-analyzers-common</artifactid> <version> ${ lucene.version}</version> </dependency> <dependency> <groupId> org.apache.lucene</group
		id> <artifactId> lucene-queryparser</artifactid> <version> ${lucene.version}</version> </dependency> <dependency> <groupId> org.apache.lucene</groupid> <artifactId> Lucene -highlighter</artifactid> <version> ${lucene.version}</version> </dependency> <dependen Cy> &LT;GROUPID&GT;ORG.WLTEA.IK-analyzer</groupid> <artifactId>IKAnalyzer2012FF_u1</artifactId> <version>1.0.0</ Version> <scope>system</scope> <systempath>${project.basedir}/lib/ikanalyzer2012ff_u1.jar </systemPath> </dependency> </dependencies>
Test code:

Package cn.slimsmart.lucene.demo.example;
Import java.io.IOException;

Import Java.io.StringReader;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.wltea.analyzer.core.IKSegmenter;
Import Org.wltea.analyzer.core.Lexeme;

Import Org.wltea.analyzer.lucene.IKAnalyzer;
		@SuppressWarnings ("Resource") public class Ikanalyzertest {public static void main (string[] args) throws Exception {
	Demo ();
		}//participle test public static void demo () throws Exception {Ikanalyzer Analyzer = new Ikanalyzer (true);
		System.out.println ("word breaker currently in use:" + Analyzer.getclass (). Getsimplename ());
		Tokenstream Tokenstream = Analyzer.tokenstream ("Content", "Harry Lazy, sycophant, followed John Doe, also live off the Day");
		Chartermattribute Chartermattribute = Tokenstream.addattribute (Chartermattribute.class); Tokenstream.reset ()//must first call the Reset method while (Tokenstream.incrementtoken ()) {System.out.println (new String (Chartermatt
		Ribute.buffer ())); } tokenstream.close(); }//separate using Ikanalyzer, you can use the core class of the IK word breaker directly, the implementation of the true Word breaker class Iksegmenter word breaker public static void Demo1 () throws Exception {Stringr
		Eader reader = new StringReader ("Harry Lazy, sycophant, follow John Doe, also live a well-off day");
		Iksegmenter ik = new Iksegmenter (reader, true);//When True, the word breaker does the maximum word length segmentation lexeme lexeme = null;
		try {while ((Lexeme = Ik.next ()) = null) System.out.println (Lexeme.getlexemetext ());
		} catch (IOException e) {e.printstacktrace ();
		} finally {reader.close ();
 }
	}
}

Note: You can expand your own dictionaries and Stopword reference articles according to our own project business:
1. IKAnalyzer2012 Chinese word breaker and go to stop words 0 basic primer
2. Lucene's Chinese word breaker Ikanalyzer
3. Ikanalyzer combined Lucene use and standalone use example simple performance test
4. Lucene Learning--ikanalyzer Chinese participle

5.Lucene 3.6 Chinese Word segmentation, paging query, highlighting, etc.

6.IKAnalyzer Chinese word segmentation, calculation of sentence similarity

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.