Here are 4 Open source Chinese word segmentation system.
1, ictclas– the world's most popular Chinese word segmentation system
Chinese lexical analysis is the basis and key of the processing. Based on the accumulation of years of research work, the Institute of Computing Technology of the Chinese Academy of Sciences Ictclas (Institute of Computing Technology, Chinese lexical analysis system) is developed. The main functions include Chinese word segmentation, POS tagging, named entity recognition, new word recognition, support for user dictionaries, support for traditional Chinese, support for GBK, UTF-8, UTF-7, Unicode and many other encoding formats. We have carefully built five years, the core upgrade 6 times, has now been upgraded to the ICTCLAS3.0. ICTCLAS3.0 Word Speed single 996kb/s, word segmentation precision 98.45%,api not more than 200KB, a variety of dictionary data compression less than 3M, is currently the world's best Chinese lexical analyzer.
System platform: Windows
Development languages: C + +, Java, C #
How to use: DLL call
Demo URL: http://ictclas.org/test.html
Open Source website: http://ictclas.org
Clear Maple Note: Ictclas has a shared version, Business Edition, industry version, support Linux platform, but not open source. Ictclas has entered the commercial, and wide range of applications, I believe that the efficiency of good segmentation.
2. httpcws– Open Source Chinese word segmentation system based on HTTP protocol
HTTPCWS is an open source Chinese word segmentation system based on HTTP protocol, which only supports Linux system at present. Httpcws Use the "Ictclas 3.0 2009 Share version Chinese word segmentation algorithm" API for word processing, to get the result of participle.
Ictclas is based on the accumulation of many years ' research work in the Institute of Computing Technology of CAS, the Chinese lexical analysis system developed by the multilayer hidden horse model, which mainly includes Chinese word segmentation, POS tagging, named entity recognition, new word recognition and user dictionary support. Ictclas after five years of careful building, the core upgrade 6 times, has now been upgraded to ICTCLAS3.0, segmentation precision 98.45%, a variety of dictionary data compression less than 3M. Ictclas in the domestic 973 Expert Group Organization's evaluation activity obtains the first place, in the first international Chinese processing Research organization Sighan The evaluation has obtained many first, is the present world best Chinese lexical analyzer.
Ictclas 3.0 Commercial Edition is charged, and the free Ictclas 3.0 share version is not open source, thesaurus is based on the People's Daily one-month corpus, many words do not exist. So I added a 190,000 words of the custom thesaurus, the Ictclas Word segmentation results are combined processing, the output of the final participle results.
Since the Ictclas 3.0 2009 share version only supports GBK encoding, if it is a UTF-8 encoded string, you can convert the Iconv function to GBK encoding, then use HTTPCWS for word processing, and finally convert back to UTF-8 code.
HTTPCWS software itself (including httpcws.cpp source files, dict/httpcws_dict.txt custom thesaurus) using NEWBSD Open Source Agreement, can be modified freely. Httpcws use of Ictclas shared version API and dict/data/directory of the Corpus, copyright and copyright of the Chinese Academy of Sciences, the Institute of Computing Technology, ictclas.org all, the use of the need to follow its relevant agreement.
System platform: Linux
Development language: C + +
How to use: HTTP service
Demo URL: http://blog.s135.com/demo/httpcws/
Open Source website: http://blog.s135.com/httpcws_v100/
Clear Maple Note: Based on Ictclas, the addition of 190,000 words to expand the thesaurus, and built into the HTTP service way, more convenient to use.
3. scws– Simple Chinese word segmentation system
SCWS has no creative element in concept, the use of a self-collected word frequency dictionary, supplemented by a certain degree of proprietary names, names, geographical names, digital age, and other rules set, by a small range of testing probably accurate rate between 90% ~ 95%, has been able to basically meet some small and medium-sized search engines, keyword extraction and other occasions to use. Scws using pure C code development, with Unix-like OS as the main platform environment, providing a shared function library to facilitate the implantation of various existing software systems. In addition, it supports GBK,UTF-8,BIG5 and other encoding and has high efficiency in cutting words.
System platform: Windows/unix
Development language: C
How to use: PHP extension
Demo URL: http://www.ftphp.com/scws/demo.php
Open Source website: http://www.ftphp.com/scws/
Clear Maple Note: As a php extension, it is easy to integrate with existing Web systems based on PHP architecture, which is a big advantage.
4. phpanalysis-php without Component segmentation system
Phpanalysis Word segmentation system is based on string matching word segmentation method, this method is also called the machine segmentation method, it is according to a certain strategy of the Chinese character string to be analyzed with a "full large" machine Dictionary of the terms of the match, if found in the dictionary a string, then matching success (identify a word). According to the different scanning direction, the string matching segmentation method can be divided into forward matching and reverse matching. According to the case of different length preference, it can be divided into the maximum (longest) matching and the minimum (shortest) matching, according to whether or not the process of POS tagging, but also can be divided into simple word segmentation method and the combination of Word segmentation and annotation integration method.
System Platform: PHP Environment
Development language: PHP
How to use: HTTP service
Demo URL: http://www.itgrass.com/phpanalysis/
Open Source website: http://www.itgrass.com/phpanalysis/
Clear Maple Note: Simple to implement, easy to use, can do some simple application, but the large amount of data calculation efficiency is not as good as the previous several.
Try a few systems, the basic word segmentation function is no problem, but some of the differences in the division of some words there are some difference in the determination of part of speech, the system is different.
5, mmseg4j
mmseg4j Java-based open source Chinese word segmentation component, provides Lucene and SOLR interface
1), MMSEG4J uses Chih-hao Tsai's mmseg algorithm to implement the Chinese word breaker, and implements Lucene Analyzer and SOLR tokenizerfactory to facilitate use in Lucene and SOLR.
2), MMSEG algorithm has two kinds of word segmentation methods: simple and complex, are based on forward maximum matching. Complex added four rules to worry about. The official said: the correct recognition rate of words reached 98.41%. MMSEG4J has implemented both of these word segmentation algorithms.
6. Pangu participle
Pangu participle is an open source Chinese word segmentation component based on. NET platform, providing the interface of Lucene (. NET version) and Hubbledotnet
Efficient: Core Duo 1.8 GHz Single-thread participle speed of 390K characters per second
Accurate: Pangu Word using a dictionary and statistical word segmentation algorithm, Word segmentation accuracy is higher.
Function: Pangu Word to provide Chinese name identification, simple mixed word, plural participle, English root, mandatory one-yuan participle, word frequency priority word, stop the use of words filter, the English proper extraction of a series of functions.
7, Ikanalyzer Open source Lightweight Chinese word tool kit
Ikanalyzer is an open-source, lightweight Chinese word segmentation toolkit based on Java language development. Starting with the 1.0 release in December 2006, Ikanalyzer has launched 3 major editions. Initially, it is an open source project Luence as the main application, combining dictionary segmentation and Grammar Analysis algorithm of Chinese phrases. The new version of IKANALYZER3.0 is developed into a Java-oriented common word segmentation component, independent of the Lucene project, while providing a default optimization implementation for Lucene.
IKANALYZER3.0 Features:
The unique "forward iterative most fine-grained segmentation algorithm" is adopted, with 600,000 characters/sec high-speed processing capability.
Multi-processor Analysis mode, support: English letter (IP address, Email, URL), number (date, commonly used Chinese words, Roman numerals, scientific notation), Chinese vocabulary (name, place names processing) and other words processing. Optimized dictionary storage, smaller memory footprint.
Support User dictionary extension definition for lucene full-text Search Optimization Query Analyzer ikqueryparser (author vomiting blood recommended); Using ambiguity analysis algorithm to optimize query keyword search permutation combination, can greatly improve the hit rate of lucene retrieval.