Recently just learning search engine participle, there are some word breaker plug-in, here to you to share the ape friends.
This article mainly introduces four word breakers (Ictclas, Ikanalyzer, ANSJ, jcseg) and a way to implement their own algorithms, as well as some thesaurus recommendations.
First, Ictclas
1.1. Introduction
Chinese lexical analysis is the basis and key to the study of the processing. Based on the accumulation of years of research work, the Chinese Academy of Technology Research Institute has developed a Ictclas (Institute of Computing Technology, Chinese Lexical Analysis System).
Its main functions include Chinese word segmentation, POS tagging, named entity recognition, new word recognition, and support for user dictionaries.
has been carefully built five years, the kernel upgrade 6 times, has now been upgraded to ICTCLAS3.0. ICTCLAS3.0 participle speed stand-alone 996kb/s, segmentation accuracy 98.45%,api not more than 200KB, a variety of dictionary data compression less than 3M, the correct rate of 97.58% (the most recent 973 expert Group evaluation Results), role-based labeling of the non-signed word recognition can get more than 90 % recall rate, where the Chinese name recognition recall rate is close to 98%, Word segmentation and POS labeling processing speed of 31.5kb/s. Ictclas and calculation of the other 14 free release of the results are widely reported by Chinese and foreign media, many domestic free Chinese word segmentation module has more or less reference to the Ictclas code. is a very good Chinese lexical analyzer.
1.2. Example
Bo Master windows64 bit, if the 32-bit system can refer to the following articles: Http://blog.sina.com.cn/s/blog_64ecfc2f0102v1jp.html, the article Ictclas windows32-bit download URL and instance details.
If it is a windows64-bit system, you can follow the steps of the blogger to implement the example.
(1) ictclas50-windows-64 Download: http://download.csdn.net/detail/u013142781/9494942
(2) Eclipse creates a normal Java project.
(3) After Ictclas50_windows_64_jni decompression, the API Directory Ictclas folder and Ictclas_i3s_ac_ictclas50.h are copied to the Java project under SRC.
(4) Copy the API directory into the root directory of the Java project except for the files and folders that are just the Ictclas folder and the Ictclas_i3s_ac_ictclas50.h.
(5) Create a test class with the following code:
Package Com.luo.test;import Java.io.unsupportedencodingexception;import Ictclas. I3s. AC. ICTCLAS50; Public classTest { Public Static void Main(string[] args) {ICTCLAS50 testICTCLAS50 =NewICTCLAS50 (); String Argu =".";//file configure.xml and Data directory stored. //Initialize Try{if(Testictclas50.ictclas_init (Argu.getbytes ("GB2312")) ==false) {System. out. println ("Init fail!");Throw NewException ("Initialization error"); } }Catch(Unsupportedencodingexception E1) {//todoauto-generated catch blockE1.printstacktrace (); }Catch(Exception E1) {//todoauto-generated catch blockE1.printstacktrace (); } String s="Chinese lexical analysis is the basis and key to the study of the";//import user dictionary pre-participle byteNativebytes[];Try{nativebytes = testictclas50.ictclas_paragraphprocess (S.getbytes ("GB2312"),0,0);//word processing //system.out.println (nativebytes.length);String Nativestr =NewString (Nativebytes,0, Nativebytes.length,"GB2312"); String[] Wordstrings=nativestr.split (" "); for(Stringstring: wordstrings) {System. out. println (string); } }Catch(Unsupportedencodingexception E1) {//todoauto-generated catch blockE1.printstacktrace (); } }}
(6) Operation Result:
Second, Ikanalyzer
2.1. Introduction
Ikanalyzer is an open source, lightweight Chinese word breaker toolkit based on Java language development.
Starting with the release of the 1.0 version in December 2006, Ikanalyzer has launched 3 large versions. Initially, it is an open source project Lucene as the application of the subject, combined with the dictionary word segmentation and Grammar analysis algorithm of Chinese sub-phrase pieces. The new version of IKANALYZER3.0 is developed as a common word breaker for Java, independent of the Lucene project, while providing the default optimizations for Lucene.
IK Analyzer 2012 Features:
1. Adopt the unique "forward iteration of the most fine-grained segmentation algorithm", to support fine-grained and intelligent segmentation of two types of segmenting mode;
2. In the system environment: Core2 i7 3.4G Dual Core, 4G memory, window 7 64 bit, Sun JDK 1.6_29 64-bit normal PC environment test, IK2012 with 1.6 million words/second (3000KB/S) high-speed processing capability.
The 3.2012 version of the intelligent word breaker supports simple word segmentation and word-count merge output.
4. Adopt multi-sub-processor analysis mode, support: English alphabet, numerals, Chinese words word-processing, compatible with Korean, Japanese characters
5. Optimized dictionary storage, smaller memory footprint. Supports user dictionary extension definitions. Specifically, in 2012 editions, dictionaries support Chinese, English, and digital mixed words.
Source: Https://code.google.com/archive/p/ik-analyzer/downloads, the downloadable source code and the "Ikanalyzer Chinese word breaker V2012 user manual."
It is highly recommended to use the Ikanalyzer Chinese word breaker V2012 manual. pdf Read all, then should have a more comprehensive understanding of Ikanalyzer.
2.2. Example
Instance steps:
(1) Download Ikanalyzer2012.jar (http://download.csdn.net/detail/u013142781/9494963) and introduce Ikanalyzer2012.jar into the Java project.
(2) New test class:
PackageLuo.test;ImportJava.io.IOException;ImportJava.io.StringReader;ImportOrg.wltea.analyzer.core.IKSegmenter;ImportOrg.wltea.analyzer.core.Lexeme; Public class iktest { Public Static void Main(string[] args)throwsIOException {String Text ="IK Analyzer is an open source toolkit that combines dictionary participle and grammatical participle. It uses a new forward iterative fine-grained segmentation algorithm. ";//Independent Lucene implementationStringReader re =NewStringReader (text); Iksegmenter IK =NewIksegmenter (Re,true); Lexeme Lex =NULL;Try{ while((Lex=ik.next ())! =NULL) {System.out.print (Lex.getlexemetext () +"|"); } }Catch(Exception e) {//Todo:handle exception} } }
(3) Operation result:
Third, ANSJ
3.1. Introduction
ANSJ Chinese participle
This is a Ictclas Java implementation. Basically rewrite all the data structures and algorithms. Dictionaries are provided by the open source version of Ictclas. And some of the manual optimizations are done.
In-memory Chinese word-breaker about 1 million words per second (speed has surpassed Ictclas)
File reads word breaker about 300,000 words per second
Accuracy can reach more than 96%
The current implementation of the. Chinese word segmentation. Chinese name recognition. User-defined Dictionaries
can be applied to natural language processing and other aspects, applicable to the word segmentation effect of a variety of projects.
3.2. Example
(1) Download ansj_seg-20130808. Jar (http://download.csdn.net/detail/u013142781/9494969) and ansj_seg-20130808. The jar is introduced into the Java project.
(2) Create a test class:
PackageCom.luo.test;ImportJava.io.IOException;ImportJava.io.StringReader;ImportOrg.ansj.domain.Term;ImportOrg.ansj.splitWord.Analysis;ImportOrg.ansj.splitWord.analysis.ToAnalysis; Public class Test { Public Static void Main(string[] args)throwsIOException {Analysis UDF =NewToanalysis (NewStringReader ("ANSJ Chinese word segmentation is a true ICT implementation. And added some of their data structures and algorithms to the participle. The perfect combination of efficiency and high accuracy!")); term term =NULL; while((Term=udf.next ())! =NULL) {System.out.print (Term.getname () +" "); } }}
(3) Operation result:
Iv. jcseg
4.1. Introduction
Jcseg is an open-source Chinese word breaker developed using Java, using the MMSEG algorithm. Word segmentation accuracy rate of up to 98.4%, support Chinese name recognition, synonym matching, stop word filter ..., see the official homepage of jcseg for details.
Official homepage: https://code.google.com/p/jcseg/
: https://code.google.com/p/jcseg/downloads/list
Jcseg Detailed features: (Can be skipped, easy to see the new version of the changes in functionality)
1. Current Highest version: jcseg-1.9.2. Compatible with the highest version lucene-4.x and the highest version solr-4.x
2. MMSEG Four kinds of filtering algorithms, the accuracy of word segmentation reached 98.41%.
3. Supports custom word libraries. Under the Lexicon folder, you can easily add/remove/change thesaurus and thesaurus content, and the word library is categorized. See below to learn how to add a thesaurus/new word to jcseg.
4. (! NEW) supports multi-catalog loading of thesaurus. Use '; ' in configuration Lexicon.path Separate multiple thesaurus catalogs.
5. (! NEW) thesaurus is divided into simplified/traditional/simple mixed thesaurus: can be used specifically for the simplified segmentation, traditional segmentation, simple and traditional mixed segmentation, and can use the following mentioned synonyms implementation, simple and traditional mutual search, JCSEG also provides two simple thesaurus management tools for simplified traditional conversion and thesaurus merging.
6. Chinese and English synonyms append/synonym match + pinyin addition. Thesaurus integrates the terms of modern Chinese dictionary and Cc-cedict Dictionary, and according to the Cc-cedict dictionary as the entry marked Pinyin, according to the "Chinese synonym Dictionary" for the entry marked with synonyms (not yet completed). Changing the Jcseg.properties configuration document allows you to add pinyin and synonyms to the word segmentation results. JCSEG New Word Library
7. Chinese numerals and Chinese score recognition, for example: "150 people have come, one-fortieth of people." "150" and "One-fortieth" in. and jcseg will automatically convert it to Arabic numerals into the word segmentation results. such as: 150, 1/40.
..., details can be downloaded to the official website of the document "jcseg-Development Help document." PDF
4.2. Example
Download Jcseg:https://code.google.com/archive/p/jcseg/downloads, bo main download is: Jcseg-1.9.2-src-jar-dict.
After decompression, jcseg-1.9.2-src-jar-dict\jcseg-1.9.2 directory, we need to use three: lexicon (which contains words used in Word library), Jcseg-core-1.9.2.jar, Jcseg.properties.
Instance steps:
(1) Introduce Jcseg-core-1.9.2.jar to the corresponding Java project.
(2) Modify the position of the jcseg.properties Lexicon.path for the Lexicon (thesaurus), as follows, remember is the forward slash:
(3) Create a test class and note the path to the code class:
Packagecom. Luo;Import Java. IO. IOException; Import Java. IO. StringReader; import org. Lionsoul. Jcseg. Asegment; import org. Lionsoul. Jcseg. Core. Adictionary; import org. Lionsoul. Jcseg. Core. Dictionaryfactory; import org. Lionsoul. Jcseg. Core. Iword; import org. Lionsoul. Jcseg. Core. Jcsegexception; import org. Lionsoul. Jcseg. Core. Jcsegtaskconfig; import org. Lionsoul. Jcseg. Core. Segmentfactory; public class Test {public static void main (string[] args) throws IOException, Jcsegexception {//Create JCSEGTASKC Onfig participle Task instance//from jcseg. PropertiesConfiguration file initialized in config jcsegtaskconfig config = new Jcsegtaskconfig ("D:/notworddevsoftware/eclipseworkspace/jcseg_test/jcseg.properties"); Config. Setappendcjkpinyin(true); Create a default thesaurus (that is:com. Webssky. Jcseg. DictionaryObject)//And the load adictionary dic = Dictionaryfactory based on the given Jcsegtaskconfig configuration instance. Createdefaultdictionary(config,true); Dic. Loadfromlexiconfile("D:/notworddevsoftware/eclipseworkspace/jcseg_test/lexicon/lex-main.lex")///This path is jcseg-1.9.4-src-jar-dict.zip the storage path of this jar package, you find the Lex-main.lex under the Lexicon folder yourselfCreate Isegment//typically use segmentfactory based on the given Adictionary and Jcsegtaskconfig#createJcseg来创建ISegment对象Make config and dic an object array to Segmentfactory. CreatejcsegMethod//jcsegtaskconfig. COMPLEX_mode represents the creation of COMPLEXSEG complex isegment participle objects//jcsegtaskconfig. Simple_mode represents the creation of simpleseg simple isegmengt participle object. Asegment seg = (asegment) segmentfactory. Createjcseg(Jcsegtaskconfig. COMPLEX_mode,new object[]{config, dic}); Set the content to be participle String str ="Jcseg is an open-source Chinese word breaker developed using Java, using the MMSEG algorithm. The accuracy of participle is up to 98.4%. "; Seg. Reset(New StringReader (str)); Get Iword Word = null for participle results; while (Word = seg. Next()) = null) {System. out. Print(Word. GetValue() +"|"); } } }
(4) Operation result:
Five, the use of their own algorithm implementation
Above Ikanalyzer, ANSJ, jcseg are Java open source projects, according to their own personalized needs to modify the source code.
Of course, you can actually write your own algorithm to achieve. The following is a blog before read an article, very detailed clear ideas: Baidu segmentation algorithm analysis.
Six, Word library recommendation
Word base is based on the implementation of the word base, the following blogger recommended a thesaurus, Sogou input cell library, the Word library is comprehensive, and has been divided into good class, such as if it is a commodity search engine, in the search for related thesaurus, help improve accuracy OH: http://pinyin.sogou.com/dict /cate/index/394
The downloaded thesaurus is. scel format, ape friends can use the "Deep Blue cell Thesaurus scel to TXT tool" to convert.
Commodity search Engine---participle (plugin introduction and Getting Started instance)