1. ikanalyzer3.0 Introduction
Ikanalyzer is an open-source lightweight Chinese Word Segmentation toolkit developed based on the Java language. Ikanalyzer has released three major versions since 1.0. Initially, it is a Chinese Word Segmentation component that combines dictionary word segmentation and text analysis algorithms based on the open-source luence project. The new version of ikanalyzer3.0 is developed into a Java-oriented public word segmentation component, independent of the Lucene project, and provides default Lucene optimization implementation.
1.1 ikanalyzer3.0 features
It adopts the unique "fine-grained Segmentation Algorithm for Forward Iteration" and has a high-speed processing capability of 0.5 million words/second.
Multi-processor Analysis Mode, supporting: English letters (IP address, email, URL), numbers (date, commonly used Chinese quantifiers, roman numerals, scientific Notation ), word Segmentation for Chinese words (name and place name processing.
Optimized dictionary storage for smaller memory usage. Supports extended definition of user dictionaries
Ikqueryparser, a query analyzer optimized for Lucene full-text search (recommended by the author), uses the ambiguity analysis algorithm to optimize the search arrangement and combination of search keywords, which can greatly improve the hit rate of Lucene search.
1.2 example of word splitting effect
Original Text 1:
Ik-analyzer is an open-source lightweight Chinese Word Segmentation toolkit developed based on the Java language. Ikanalyzer has released three major versions since 1.0. Word splitting result:
Ik-analyzer | Yes | one | open-source | Based on | Java | language | development | lightweight |
Level | magnitude | Chinese | word segmentation | toolkit | tool | from | 2006 | year | 12 | month | available | 1.0 | version | START | ikanalyzer | available | out |
3 |
Large |
Versions | version
Original Text 2:
Yonghe fashion jewelry Co., Ltd. Word splitting result:
: Yonghe | kimono | clothing | ornament | decoration | ornament | limited | Company
Original Text 3:
Author's blog: linliangyi2007.javaeye.com email address: [email protected]
Word splitting result: Author | blog | linliangyi2007.javaeye.com | 2007 | email |
Address |
[Email protected] | 2005
Author's blog: linliangyi2007.javaeye.com Email: [email protected]
Word splitting result: Author | blog | linliangyi2007.javaeye.com | 2007 | email |
Address |
[Email protected] | 2005
2. User Guide
2.1
Googlecode open source project: http://code.google.com/p/ik-analyzer/
Googlecodesvn download: http://ik-analyzer.googlecode.com/svn/trunk/
2.2 install and deploy the service
The ikanalyzer installation package includes:
. Ikanalyzer3.0ga. Jar
Ikanalyzer. cfg. xml
It is easy to install and deploy. jar is deployed in the lib directory of the project; ikanalyzer. cfg. XML files are placed in the Code root directory (for web projects, usually the WEB-INF/classes directory, the same as the hibernate, log4j and other configuration files.
2.3 Lucene Quick Start
Sample Code
Ikanalyzerdemo
Demo /**
* Ikanalyzerdemo * @ paramargs */
Import java. Io. ioexception;
Import org. Apache. Lucene. analysis. analyzer;
Import org.apache.e.doc ument. Document;
Import org.apache.e.doc ument. field;
Import org. Apache. Lucene. Index. corruptindexexception;
Import org. Apache. Lucene. Index. indexwriter;
Import org. Apache. Lucene. Search. indexsearcher;
Import org. Apache. Lucene. Search. query;
Import org. Apache. Lucene. Search. scoredoc;
Import org. Apache. Lucene. Search. topdocs;
Import org. Apache. Lucene. Store. Directory;
Import org. Apache. Lucene. Store. lockobtainfailedexception;
Import org. Apache. Lucene. Store. ramdirectory; // reference the ikanalyzer3.0 class
Import org. wltea. analyzer. Lucene. ikanalyzer;
Import org. wltea. analyzer. Lucene. ikqueryparser;
Import org. wltea. analyzer. Lucene. iksimilarity;
/**
*/* @ Authorlinly
**/
Public class ikanalyzerdemo {
Public static void main (string [] ARGs ){
// The domain name of your eDocument
String fieldname = "text"; // retrieve content
String text = "ikanalyzer is an open-source Chinese Word Segmentation toolkit that combines dictionary word segmentation and grammar word segmentation. It uses a new fine-grained Splitting Algorithm for forward iteration. ";
// Instantiate the ikanalyzer word Divider
Analyzer analyzer = new ikanalyzer ();
Directory directory = NULL;
Indexwriter iwriter = NULL;
Indexsearcher isearcher = NULL;
Try {
// Create a memory index object
Directory = new ramdirectory ();
Iwriter = new indexwriter (directory, analyzer, true,
Indexwriter. maxfieldlength. Limited );
Document Doc = new document ();
Doc. Add (new field (fieldname, text, field. Store. Yes,
Field. Index. Analyzed ));
Iwriter. adddocument (DOC );
Iwriter. Close ();
// Instantiate the searcher
Isearcher = new indexsearcher (directory); // use the iksimilarity similarity evaluator in the Indexer
Isearcher. setsimilarity (New iksimilarity ());
String keyword = "Chinese Word Segmentation toolkit ";
// Use the ikqueryparser query analyzer to construct a query object
Query query = ikqueryparser. parse (fieldname, keyword); // five records with the highest similarity
Topdocs = isearcher. Search (query, 5 );
System. Out. println ("Hit:" + topdocs. totalhits); // output result
Scoredoc [] scoredocs = topdocs. scoredocs;
For (INT I = 0; I <topdocs. totalhits; I ++ ){
Document targetdoc = isearcher.doc(scoredocs? I =.doc );
System. Out. println ("content:" + targetdoc. tostring ());
}
} Catch (corruptindexexception e ){
E. printstacktrace ();
} Catch (lockobtainfailedexception e ){
E. printstacktrace ();
} Catch (ioexception e ){
E. printstacktrace ();
} Finally {
If (isearcher! = NULL ){
Try {
Isearcher. Close ();
} Catch (ioexception e ){
E. printstacktrace ();
}
}
If (directory! = NULL ){
Try {
Directory. Close ();
} Catch (ioexception e ){
E. printstacktrace ();
}
}
}
}
}
Execution result:
Hit: 1
Content: document <stored/uncompressed, indexed, tokenized <text: ikanalyzer is an open-source Chinese Word Segmentation toolkit that combines dictionary word segmentation and grammar word segmentation. It uses a new fine-grained Splitting Algorithm for Forward Iteration.>
Ikanalyzer Chinese Word Divider