Ikanalyzer Chinese Word Divider

Last Update:2014-09-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. ikanalyzer3.0 Introduction

Ikanalyzer is an open-source lightweight Chinese Word Segmentation toolkit developed based on the Java language. Ikanalyzer has released three major versions since 1.0. Initially, it is a Chinese Word Segmentation component that combines dictionary word segmentation and text analysis algorithms based on the open-source luence project. The new version of ikanalyzer3.0 is developed into a Java-oriented public word segmentation component, independent of the Lucene project, and provides default Lucene optimization implementation.

1.1 ikanalyzer3.0 features

It adopts the unique "fine-grained Segmentation Algorithm for Forward Iteration" and has a high-speed processing capability of 0.5 million words/second.

Multi-processor Analysis Mode, supporting: English letters (IP address, email, URL), numbers (date, commonly used Chinese quantifiers, roman numerals, scientific Notation ), word Segmentation for Chinese words (name and place name processing.

Optimized dictionary storage for smaller memory usage. Supports extended definition of user dictionaries

Ikqueryparser, a query analyzer optimized for Lucene full-text search (recommended by the author), uses the ambiguity analysis algorithm to optimize the search arrangement and combination of search keywords, which can greatly improve the hit rate of Lucene search.

1.2 example of word splitting effect

Original Text 1:

Ik-analyzer is an open-source lightweight Chinese Word Segmentation toolkit developed based on the Java language. Ikanalyzer has released three major versions since 1.0. Word splitting result:

3 |

Large |

Versions | version

Original Text 2:

Yonghe fashion jewelry Co., Ltd. Word splitting result:

Original Text 3:

Author's blog: linliangyi2007.javaeye.com email address: [email protected]

Address |

[Email protected] | 2005

Author's blog: linliangyi2007.javaeye.com Email: [email protected]

Address |

[Email protected] | 2005

2. User Guide

2.1

Googlecode open source project: http://code.google.com/p/ik-analyzer/

Googlecodesvn download: http://ik-analyzer.googlecode.com/svn/trunk/

2.2 install and deploy the service

The ikanalyzer installation package includes:

. Ikanalyzer3.0ga. Jar
Ikanalyzer. cfg. xml

It is easy to install and deploy. jar is deployed in the lib directory of the project; ikanalyzer. cfg. XML files are placed in the Code root directory (for web projects, usually the WEB-INF/classes directory, the same as the hibernate, log4j and other configuration files.

2.3 Lucene Quick Start

Sample Code

Ikanalyzerdemo

Demo /**

* Ikanalyzerdemo * @ paramargs */

Import java. Io. ioexception;

Import org. Apache. Lucene. analysis. analyzer;

Import org.apache.e.doc ument. Document;

Import org.apache.e.doc ument. field;

Import org. Apache. Lucene. Index. corruptindexexception;

Import org. Apache. Lucene. Index. indexwriter;

Import org. Apache. Lucene. Search. indexsearcher;

Import org. Apache. Lucene. Search. query;

Import org. Apache. Lucene. Search. scoredoc;

Import org. Apache. Lucene. Search. topdocs;

Import org. Apache. Lucene. Store. Directory;

Import org. Apache. Lucene. Store. lockobtainfailedexception;

Import org. Apache. Lucene. Store. ramdirectory; // reference the ikanalyzer3.0 class

Import org. wltea. analyzer. Lucene. ikanalyzer;

Import org. wltea. analyzer. Lucene. ikqueryparser;

Import org. wltea. analyzer. Lucene. iksimilarity;

/**

*/* @ Authorlinly

**/

Public class ikanalyzerdemo {

Public static void main (string [] ARGs ){

// The domain name of your eDocument

String fieldname = "text"; // retrieve content

String text = "ikanalyzer is an open-source Chinese Word Segmentation toolkit that combines dictionary word segmentation and grammar word segmentation. It uses a new fine-grained Splitting Algorithm for forward iteration. ";

// Instantiate the ikanalyzer word Divider

Analyzer analyzer = new ikanalyzer ();

Directory directory = NULL;

Indexwriter iwriter = NULL;

Indexsearcher isearcher = NULL;

Try {

// Create a memory index object

Directory = new ramdirectory ();

Iwriter = new indexwriter (directory, analyzer, true,

Indexwriter. maxfieldlength. Limited );

Document Doc = new document ();

Doc. Add (new field (fieldname, text, field. Store. Yes,

Field. Index. Analyzed ));

Iwriter. adddocument (DOC );

Iwriter. Close ();

// Instantiate the searcher

Isearcher = new indexsearcher (directory); // use the iksimilarity similarity evaluator in the Indexer

Isearcher. setsimilarity (New iksimilarity ());

String keyword = "Chinese Word Segmentation toolkit ";

// Use the ikqueryparser query analyzer to construct a query object

Query query = ikqueryparser. parse (fieldname, keyword); // five records with the highest similarity

Topdocs = isearcher. Search (query, 5 );

System. Out. println ("Hit:" + topdocs. totalhits); // output result

Scoredoc [] scoredocs = topdocs. scoredocs;

For (INT I = 0; I <topdocs. totalhits; I ++ ){

Document targetdoc = isearcher.doc(scoredocs? I =.doc );

System. Out. println ("content:" + targetdoc. tostring ());

}

} Catch (corruptindexexception e ){

E. printstacktrace ();

} Catch (lockobtainfailedexception e ){

E. printstacktrace ();

} Catch (ioexception e ){

E. printstacktrace ();

} Finally {

If (isearcher! = NULL ){

Try {

Isearcher. Close ();

} Catch (ioexception e ){

E. printstacktrace ();

}

If (directory! = NULL ){

Try {

Directory. Close ();

} Catch (ioexception e ){

E. printstacktrace ();

}

Execution result:

Content: document <stored/uncompressed, indexed, tokenized <text: ikanalyzer is an open-source Chinese Word Segmentation toolkit that combines dictionary word segmentation and grammar word segmentation. It uses a new fine-grained Splitting Algorithm for Forward Iteration.>

Ikanalyzer Chinese Word Divider

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Ikanalyzer Chinese Word Divider

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Ikanalyzer Chinese Word Divider

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support