Chinese word segmentation is the basis for Chinese Content Retrieval and text analysis. It is mainly used in the search engine and Data Mining Fields. Chinese is the basic unit of the word, and words are not separated by spaces like English, therefore, the difficulty of Chinese word segmentation lies in how to accurately and quickly perform word segmentation. The following describes four open-source Chinese Word Segmentation systems.
1. ICTCLAS-the world's most popular Chinese Word Segmentation System
Chinese lexical analysis is the basis and key for processing Chinese information. Based on years of research efforts, the Institute of Computing Technology of the Chinese Emy of Sciences developed the Chinese lexical analysis system ICTCLAS (Institute of computing technology, Chinese Lexical Analysis System). Its main functions include Chinese word segmentation; part-of-speech tagging, Named Entity recognition, new word recognition, and support for user dictionaries; support for traditional Chinese; Support for GBK, UTF-8, UTF-7, Unicode and Other encoding formats. We have carefully built five years of kernel upgrade six times, and have now upgraded to ictclas3.0. Ictclas3.0 word segmentation speed: 98.45% kb/s for a single machine, Word Segmentation accuracy, kb for APIs, and less than 3 MB for various Dictionary data compression. ictclas3.0 is the best Chinese Lexical analyzer in the world.
System Platform: Windows
Development languages: C/C ++, Java, and C #
Usage: DLL call
Demo URL: http://ictclas.org/test.html
Open source Official Website: http://ictclas.org
Note: ICTCLAS has a shared version, a commercial version, and an industry version. It supports the Linux platform but is not open-source. ICTCLAS has been commercially available and has a wide range of applications. We believe that the word segmentation efficiency is superior.
2. httpcws-HTTP-based open-source Chinese Word Segmentation System
Httpcws is an HTTP-based open-source Chinese word segmentation system. Currently, it only supports Linux systems. Httpcws uses the API "ICTCLAS 3.0 2009 shared Chinese word segmentation algorithm" to perform word segmentation and obtain the word segmentation result.
ICTCLAS is a Chinese Lexical Analysis System developed by the Institute of Computing Technology of the Chinese Emy of sciences based on the multi-layer hidden horse model. Its main functions include Chinese word segmentation, part-of-speech tagging, and named entity recognition; new word recognition. It also supports user dictionaries. ICTCLAS has been carefully built over the past five years and has been upgraded six times. It has now been upgraded to ictclas3.0 with a word segmentation accuracy of 98.45%. The compression of various Dictionary data is less than 3 MB. ICTCLAS won the first place in the evaluation organized by the 973 Expert Group in China, and won the first place in the evaluation conducted by sighan, the first International Research Institute for Chinese processing, is the world's best Chinese Lexical analyzer.
The ICTCLAS 3.0 commercial version is charged, while the free ICTCLAS 3.0 Shared version is not open-source. The Lexicon is derived from the corpus of the People's Daily for one month. Many words do not exist. Therefore, I added a 0.19 million-word custom dictionary to merge the ICTCLAS word splitting results and output the final word splitting results.
Because ICTCLAS 3.0 2009 shared version only supports GBK encoding, therefore, if it is a UTF-8 encoded string, you can first use the iconv function to convert to GBK encoding, then use httpcws for word segmentation, and finally convert back to UTF-8 encoding.
The httpcws software (including httpcws. cpp source file, dict/httpcws_dict.txt custom dictionary) adopts the newbsd open-source protocol and can be freely modified. The ICTCLAS shared API used by httpcws and the corpus in the dict/data/directory are copyrighted and copyrighted by the Institute of Computing Technology of the Chinese Emy of sciences and ictclas.org. The use of the API shall comply with relevant protocols.
System Platform: Linux
Development language: C ++
Usage: HTTP service
Demo URL: http://blog.s135.com/demo/httpcws/
Open source Official Website: http://blog.s135.com/httpcws_v100/
Qingfeng Note: Based on ICTCLAS, an extended Dictionary of 0.19 million words is added and built into an HTTP service, making it easier to use.
3. scws-simple Chinese Word Segmentation System
In terms of concept, scws does not have any innovative components. It uses a Word Frequency dictionary collected by itself, supplemented by a certain set of rules such as proprietary names, names, place names, and digital ages, the accuracy of a small-scale test is approximately 90% ~ Between 95%, can basically meet the needs of some small and medium-sized search engines, keyword extraction and other occasions. Scws is developed using pure C code. It uses UNIX-like OS as the main platform environment and provides a shared function library to facilitate the implantation of various existing software systems. In addition, it supports GBK, UTF-8, big5 and other Chinese character encoding, Word Segmentation efficiency is high.
System Platform: Windows/Unix
Development language: c
Usage: PhP Extension
Demo URL: http://www.ftphp.com/scws/demo.php
Open source Official Website: http://www.ftphp.com/scws/
Qingfeng notes: As a PHP extension, it is easy to continue integration with the existing PHP-based web system, which is a major advantage.
4. phpanalysis-php component-less word splitting system
The phpanalysis Word Segmentation System is a string-matching word segmentation method. This method is also called the mechanical word segmentation method, it matches the Chinese character string to be analyzed with the entry in a "sufficiently large" machine dictionary according to certain policies. If a string is found in the dictionary, the match is successful (a word is recognized ). According to the Scanning direction, the string matching and word segmentation methods can be divided into forward matching and reverse matching. According to the priority matching of different lengths, they can be divided into maximum (longest) Matching and minimum (shortest) matching; based on whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging.
System Platform: PhP Environment
Development language: PHP
Usage: HTTP service
Demo URL: http://www.itgrass.com/phpanalysis/
Open source Official Website: http://www.itgrass.com/phpanalysis/
Qingfeng notes: it is easy to implement and easy to use. It can be used for some simple applications, but the computing efficiency of large data volumes is not as high as that of the previous ones.
I tried several systems and found that the basic word segmentation function is okay, but there are some differences in the division of some words. For part-of-speech determination, there are differences between systems.
5. mmseg4j
Mmseg4jJava-based open-source Chinese Word Segmentation component that provides Lucene and SOLR Interfaces
1) mmseg4j implements a Chinese Word divider using the Chih-hao Tsai mmseg algorithm, and implements Lucene analyzer and SOLR tokenizerfactory to facilitate use in Lucene and SOLR.
2) The mmseg algorithm has two word segmentation methods: simple and complex, which are based on forward maximum matching. Complex adds four rules. Official saying: the correct word recognition rate reaches 98.41%. Mmseg4j has implemented these two word segmentation algorithms.
6. pangu Word Segmentation
Pangu word segmentation is an open-source Chinese Word Segmentation component based on the. NET platform. It provides interfaces for Lucene (. NET version) and hubbledotnet.
Efficient: Core Duo 1.8 GHz Single thread word splitting speed is 390 k characters per second
Accurate: pangu word segmentation adopts a word segmentation algorithm combined with dictionary and statistics, which ensures a high accuracy of word segmentation.
Function: pangu word segmentation provides a series of functions such as Chinese name recognition, simplified and traditional hybrid word segmentation, multi-element word segmentation, root-based English words, forced one-dollar word segmentation, Word Frequency first word segmentation, word filter disabled, and English name extraction.
7. ikanalyzer open-source lightweight Chinese Word Segmentation Toolkit
Ikanalyzer is an open-source lightweight Chinese Word Segmentation toolkit developed based on the Java language. Ikanalyzer has released three major versions since 1.0. Initially, it is a Chinese Word Segmentation component that combines dictionary word segmentation and grammar analysis algorithms based on the open-source luence project. The new version of ikanalyzer3.0 is developed into a Java-oriented public word segmentation component, independent of the Lucene project, and provides default Lucene optimization implementation.
Ikanalyzer3.0 features:
It adopts the unique "fine-grained Segmentation Algorithm for Forward Iteration" and has a high-speed processing capability of 0.6 million words/second.
Multi-processor Analysis Mode, supporting: English letters (IP address, email, URL), numbers (date, commonly used Chinese quantifiers, roman numerals, scientific Notation ), word Segmentation for Chinese words (name and place name processing. Optimized dictionary storage for smaller memory usage.
Supports extended definition of the user dictionary. The query analyzer ikqueryparser (recommended by the author) is used for full-text Lucene search optimization. The fuzzy analysis algorithm is used to optimize the search arrangement and combination of query keywords, this greatly improves the hit rate of Lucene search.