Lucene Chinese word segmentation ik Analyzer

Last Update:2018-08-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

IK Analyzer is an open-source, lightweight Chinese word breaker toolkit developed in the Java language. Starting with the release of the 1.0 version in December 2006, Ikanalyzer has launched 4 large versions. Initially, it is an open source project Luence as the application of the main, combined with the dictionary word segmentation and Grammar analysis algorithm in Chinese language sub-phrase pieces. Starting with the 3.0 release, IK evolved into a common Java-oriented word breaker, independent of the Lucene project, while providing the default optimizations for Lucene. In the 2012 version, IK implements a simple word segmentation ambiguity elimination algorithm, which marks the derivation of the IK word breaker from the simple dictionary participle to the simulation semantic participle.
IK Analyzer 2012 Features:
1. Adopt the unique "forward iteration of the most fine-grained segmentation algorithm", to support fine-grained and intelligent segmentation of two types of segmenting mode;
2. In the system environment: Core2 i7 3.4G Dual Core, 4G memory, window 7 64 bit, Sun JDK 1.6_29 64-bit normal PC environment test, IK2012 with 1.6 million words/second (3000KB/S) high-speed processing capability.
The 3.2012 version of the intelligent word breaker supports simple word segmentation and word-count merge output.
4. Adopt multi-sub-processor analysis mode, support: English alphabet, numerals, Chinese words word-processing, compatible with Korean, Japanese characters
5. Optimized dictionary storage, smaller memory footprint. Supports user dictionary extension definitions. Specifically, in 2012 editions, dictionaries support Chinese, English, and digital mixed words.
Project website:
Google:https://code.google.com/p/ik-analyzer

Github:https://github.com/wks/ik-analyzer

Https://github.com/linvar/IKAnalyzer
Since IK Analyzer is not currently compatible with the latest version of Luence, in the IK analyer FF hotfix 1 full bundle, Support LUCENE4.0/SOLR 4.0, below we are lucene4 and the latest version of IK Analyzer demonstrates Chinese participle as an example.
To download the latest https://ik-analyzer.googlecode.com/files/IK%20Analyzer%202012FF_hf1.zip, place the following files under Classpath:
Stopwords.dic is a deactivated Word file
Ext.dic a custom dictionary
IKAnalyzer.cfg.xml participle expansion profile (mainly used to tune Stopwords.dic, Ext.dic)

Add the Ikanalyzer2012ff_u1.jar to the project Lib:

<properties> <lucene.version>4.10.4</lucene.version> </properties> <dependencies> & lt;dependency> <groupId> org.apache.lucene</groupid> <artifactId> lucene-core</artifactid&
			Gt <version> ${lucene.version}</version> </dependency> <dependency> <groupId> Org.apache .lucene</groupid> <artifactId> lucene-analyzers-common</artifactid> <version> ${ lucene.version}</version> </dependency> <dependency> <groupId> org.apache.lucene</group
		id> <artifactId> lucene-queryparser</artifactid> <version> ${lucene.version}</version> </dependency> <dependency> <groupId> org.apache.lucene</groupid> <artifactId> Lucene -highlighter</artifactid> <version> ${lucene.version}</version> </dependency> <dependen Cy> &LT;GROUPID&GT;ORG.WLTEA.IK-analyzer</groupid> <artifactId>IKAnalyzer2012FF_u1</artifactId> <version>1.0.0</ Version> <scope>system</scope> <systempath>${project.basedir}/lib/ikanalyzer2012ff_u1.jar </systemPath> </dependency> </dependencies>

Test code:

Package cn.slimsmart.lucene.demo.example;
Import java.io.IOException;

Import Java.io.StringReader;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.wltea.analyzer.core.IKSegmenter;
Import Org.wltea.analyzer.core.Lexeme;

Import Org.wltea.analyzer.lucene.IKAnalyzer;
		@SuppressWarnings ("Resource") public class Ikanalyzertest {public static void main (string[] args) throws Exception {
	Demo ();
		}//participle test public static void demo () throws Exception {Ikanalyzer Analyzer = new Ikanalyzer (true);
		System.out.println ("word breaker currently in use:" + Analyzer.getclass (). Getsimplename ());
		Tokenstream Tokenstream = Analyzer.tokenstream ("Content", "Harry Lazy, sycophant, followed John Doe, also live off the Day");
		Chartermattribute Chartermattribute = Tokenstream.addattribute (Chartermattribute.class); Tokenstream.reset ()//must first call the Reset method while (Tokenstream.incrementtoken ()) {System.out.println (new String (Chartermatt
		Ribute.buffer ())); } tokenstream.close(); }//separate using Ikanalyzer, you can use the core class of the IK word breaker directly, the implementation of the true Word breaker class Iksegmenter word breaker public static void Demo1 () throws Exception {Stringr
		Eader reader = new StringReader ("Harry Lazy, sycophant, follow John Doe, also live a well-off day");
		Iksegmenter ik = new Iksegmenter (reader, true);//When True, the word breaker does the maximum word length segmentation lexeme lexeme = null;
		try {while ((Lexeme = Ik.next ()) = null) System.out.println (Lexeme.getlexemetext ());
		} catch (IOException e) {e.printstacktrace ();
		} finally {reader.close ();
 }
	}
}

Note: You can expand your own dictionaries and Stopword reference articles according to our own project business:
1. IKAnalyzer2012 Chinese word breaker and go to stop words 0 basic primer
2. Lucene's Chinese word breaker Ikanalyzer
3. Ikanalyzer combined Lucene use and standalone use example simple performance test
4. Lucene Learning--ikanalyzer Chinese participle

5.Lucene 3.6 Chinese Word segmentation, paging query, highlighting, etc.

6.IKAnalyzer Chinese word segmentation, calculation of sentence similarity

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene Chinese word segmentation ik Analyzer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene Chinese word segmentation ik Analyzer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support