Ikanalyzer Chinese Word Divider

Source: Internet
Author: User

1. ikanalyzer3.0 Introduction

Ikanalyzer is an open-source lightweight Chinese Word Segmentation toolkit developed based on the Java language. Ikanalyzer has released three major versions since 1.0. Initially, it is a Chinese Word Segmentation component that combines dictionary word segmentation and text analysis algorithms based on the open-source luence project. The new version of ikanalyzer3.0 is developed into a Java-oriented public word segmentation component, independent of the Lucene project, and provides default Lucene optimization implementation.

1.1 ikanalyzer3.0 features

It adopts the unique "fine-grained Segmentation Algorithm for Forward Iteration" and has a high-speed processing capability of 0.5 million words/second.

Multi-processor Analysis Mode, supporting: English letters (IP address, email, URL), numbers (date, commonly used Chinese quantifiers, roman numerals, scientific Notation ), word Segmentation for Chinese words (name and place name processing.

Optimized dictionary storage for smaller memory usage. Supports extended definition of user dictionaries

Ikqueryparser, a query analyzer optimized for Lucene full-text search (recommended by the author), uses the ambiguity analysis algorithm to optimize the search arrangement and combination of search keywords, which can greatly improve the hit rate of Lucene search.

1.2 example of word splitting effect

Original Text 1:

Ik-analyzer is an open-source lightweight Chinese Word Segmentation toolkit developed based on the Java language. Ikanalyzer has released three major versions since 1.0. Word splitting result:

Ik-analyzer | Yes | one | open-source | Based on | Java | language | development | lightweight |

Level | magnitude | Chinese | word segmentation | toolkit | tool | from | 2006 | year | 12 | month | available | 1.0 | version | START | ikanalyzer | available | out |

3 |

Large |

Versions | version

Original Text 2:

Yonghe fashion jewelry Co., Ltd. Word splitting result:

: Yonghe | kimono | clothing | ornament | decoration | ornament | limited | Company

Original Text 3:

Author's blog: linliangyi2007.javaeye.com email address: [email protected]

Word splitting result: Author | blog | linliangyi2007.javaeye.com | 2007 | email |

Address |

[Email protected] | 2005


Author's blog: linliangyi2007.javaeye.com Email: [email protected]

Word splitting result: Author | blog | linliangyi2007.javaeye.com | 2007 | email |

Address |

[Email protected] | 2005

2. User Guide

2.1

Googlecode open source project: http://code.google.com/p/ik-analyzer/

Googlecodesvn download: http://ik-analyzer.googlecode.com/svn/trunk/

2.2 install and deploy the service

The ikanalyzer installation package includes:

  1. . Ikanalyzer3.0ga. Jar

  2. Ikanalyzer. cfg. xml

It is easy to install and deploy. jar is deployed in the lib directory of the project; ikanalyzer. cfg. XML files are placed in the Code root directory (for web projects, usually the WEB-INF/classes directory, the same as the hibernate, log4j and other configuration files.

2.3 Lucene Quick Start

Sample Code

Ikanalyzerdemo

Demo /**

* Ikanalyzerdemo * @ paramargs */

Import java. Io. ioexception;

Import org. Apache. Lucene. analysis. analyzer;

Import org.apache.e.doc ument. Document;

Import org.apache.e.doc ument. field;

Import org. Apache. Lucene. Index. corruptindexexception;

Import org. Apache. Lucene. Index. indexwriter;

Import org. Apache. Lucene. Search. indexsearcher;

Import org. Apache. Lucene. Search. query;

Import org. Apache. Lucene. Search. scoredoc;

Import org. Apache. Lucene. Search. topdocs;

Import org. Apache. Lucene. Store. Directory;

Import org. Apache. Lucene. Store. lockobtainfailedexception;

Import org. Apache. Lucene. Store. ramdirectory; // reference the ikanalyzer3.0 class

Import org. wltea. analyzer. Lucene. ikanalyzer;

Import org. wltea. analyzer. Lucene. ikqueryparser;

Import org. wltea. analyzer. Lucene. iksimilarity;

/**

*/* @ Authorlinly

**/


Public class ikanalyzerdemo {

Public static void main (string [] ARGs ){

// The domain name of your eDocument

String fieldname = "text"; // retrieve content

String text = "ikanalyzer is an open-source Chinese Word Segmentation toolkit that combines dictionary word segmentation and grammar word segmentation. It uses a new fine-grained Splitting Algorithm for forward iteration. ";

// Instantiate the ikanalyzer word Divider

Analyzer analyzer = new ikanalyzer ();

Directory directory = NULL;

Indexwriter iwriter = NULL;

Indexsearcher isearcher = NULL;

Try {

// Create a memory index object

Directory = new ramdirectory ();

Iwriter = new indexwriter (directory, analyzer, true,

Indexwriter. maxfieldlength. Limited );

Document Doc = new document ();

Doc. Add (new field (fieldname, text, field. Store. Yes,

Field. Index. Analyzed ));

Iwriter. adddocument (DOC );

Iwriter. Close ();

// Instantiate the searcher

Isearcher = new indexsearcher (directory); // use the iksimilarity similarity evaluator in the Indexer

Isearcher. setsimilarity (New iksimilarity ());

String keyword = "Chinese Word Segmentation toolkit ";

// Use the ikqueryparser query analyzer to construct a query object

Query query = ikqueryparser. parse (fieldname, keyword); // five records with the highest similarity

Topdocs = isearcher. Search (query, 5 );

System. Out. println ("Hit:" + topdocs. totalhits); // output result

Scoredoc [] scoredocs = topdocs. scoredocs;

For (INT I = 0; I <topdocs. totalhits; I ++ ){

Document targetdoc = isearcher.doc(scoredocs? I =.doc );

System. Out. println ("content:" + targetdoc. tostring ());

}

} Catch (corruptindexexception e ){

E. printstacktrace ();

} Catch (lockobtainfailedexception e ){

E. printstacktrace ();

} Catch (ioexception e ){

E. printstacktrace ();

} Finally {

If (isearcher! = NULL ){

Try {

Isearcher. Close ();

} Catch (ioexception e ){

E. printstacktrace ();

}

}

If (directory! = NULL ){

Try {

Directory. Close ();

} Catch (ioexception e ){

E. printstacktrace ();

}

}

}

}

}

Execution result:

Hit: 1

Content: document <stored/uncompressed, indexed, tokenized <text: ikanalyzer is an open-source Chinese Word Segmentation toolkit that combines dictionary word segmentation and grammar word segmentation. It uses a new fine-grained Splitting Algorithm for Forward Iteration.>









Ikanalyzer Chinese Word Divider

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.