Summary of Chinese Word Segmentation project (open source/api interface)

Source: Internet
Author: User
Tags data structures versions
1) Ictclas

One of the earliest Chinese open-source participle projects, developed by the Zhang Huaping and Liu Qun of CAs, is compiled in C + +, based on the study of Chinese lexical analysis based on multi-layer hidden Markov. The open source version is Freeictclas, the latest API call version of the nlpir/ictclas2014 Word system (NLPIR Word system formerly published in 2000 Ictclas lexical analysis system, starting from 2009, in order to work with the previous large partition, and promote the Nlpir natural language processing and information retrieval sharing platform, the adjustment named Nlpir Word system)
Freeictclas Source code address is:
Https://github.com/hecor/ICTCLAS-2009-free
Https://github.com/pierrchen/ictclas_plus (Ictclas Version 1.0 for Linux)
http://download.csdn.net/detail/shinezlee/1535796
http://www.codeforge.cn/article/106151
nlpir/ictclas2014 API Download Address is:
Http://ictclas.nlpir.org/downloads
Other versions:

(a) on the basis of Freeictclas, the C # version written by Lu Zhenyu according to the open source version C + +.
The download address is:
Https://github.com/smartbooks/SharpICTCLAS (original)
Https://github.com/geekfivestart/SharpICTCLAS (multi-threaded version supported)

(b) The Ictclas Word system code and Sharpictclas code understanding can be consulted:
Http://www.cnblogs.com/zhenyulu/articles/653254.html
Http://sewm.pku.edu.cn/QA/reference/ICTCLAS/FreeICTCLAS/codes.html

(c) The ICTCLAS4J Chinese word segmentation system is a Java open source word-breaking project sinboy on the basis of freeictclas, simplifying the complexity of the original word segmentation program.
The download address is:
http://sourceforge.net/projects/ictclas4j/
https://code.google.com/p/ictclas4j/

(d) Ictclas python call
Python Cut-nlpir (ICTCLAS2013) can be referenced by:
http://ictclas.nlpir.org/newsDetail?DocId=382
Python wrapper for Ictclas 2015 can be consulted:
Https://github.com/haobibo/ICTCLAS_Python_Wrapper
Https://github.com/tsroten/pynlpir (a foreign brother, and a document about http://pynlpir.rtfd.org) 2) mmseg

MMSEG algorithm with Chih-hao Tsai (A Word identification System for Mandarin Chinese Text Based on both variants of the Maximum Matchi ng algorithm). The MMSEG algorithm has two methods of Word segmentation: simple (only forward maximum matching) and complex (Three-word chunk maximum matching and 3 additional rules to Solve ambiguities), are based on the positive maximum matching, Complex added four rules.
The source code download address is:
http://technology.chtsai.org/mmseg/
Note:

(a) libmmseg is a Chinese word segmentation software coreseek.com for Sphinx full-text search engine, and its Chinese word segmentation method under the GPL is also Tsai algorithm using Chih-hao mmseg. LIBMMSEG is developed in C + + and supports both the Linux platform and the Windows platform.
The source code download address is:
http://www.coreseek.cn/opensource/mmseg/

(b) Friso is a Chinese word breaker developed using the C language and is implemented using the popular MMSEG algorithm. Support for UTF-8/GBK encoded shards, bundled with PHP extensions and Sphinx token plugins
Three split modes: (1). Simple mode: FMM algorithm (2). Complex mode-mmseg four filtering algorithm (3) Detection mode: Returns only the entries already in the thesaurus
The source code download address is:
https://code.google.com/p/friso/
Http://git.oschina.net/lionsoul/friso

(c) mmseg4j is a Java open-source Chinese word breaker based on the MMSEG algorithm, providing Lucene and SOLR interfaces
The source code download address is:
https://code.google.com/p/mmseg4j/

(d) rmmseg is written in pure Ruby. Rmmsegis an implementation of MMSEG word segmentation algorithm. It is based on variants of maximum matching algorithms.
The source code download address is:
http://rmmseg.rubyforge.org/

(e) Rmmseg-cpp is a re-written of the original Rmmseggem in C + +, the core part was written in C + + independent of Ruby. It Ismuch faster and cosumes much less memory than rmmseg. The interface of Rmmseg-cpp is almost identical to rmmseg.
The source code download address is:
http://rmmseg-cpp.rubyforge.org/
https://github.com/pluskid/rmmseg-cpp/

(f) Pymmseg-cpp is a Python interface to Rmmseg-cpp.
The source code download address is:
https://github.com/pluskid/pymmseg-cpp/
https://code.google.com/p/pymmseg-cpp/

3) Ikanalyzer

Ikanalyzer is an open source Java language-based lightweight Chinese word breaker toolkit. Starting with the release of the 1.0 version in December 2006, Ikanalyzer has launched 3 large versions. Initially, it is an open source project Luence as the application of the main, combined with the dictionary word segmentation and Grammar analysis algorithm in Chinese language sub-phrase pieces. The new version of IKAnalyzer3.0 uses a unique "forward iterative fine-grained segmentation algorithm" that has evolved into a common Java-oriented word breaker, independent of the Lucene project, while providing the default optimizations for Lucene.
The source code download address is:
https://code.google.com/p/ik-analyzer/
Https://github.com/yozhao/IKAnalyzer

4) FNLP (FUDANNLP)

FUDANNLP is mainly for the Chinese natural language processing and development of the Toolkit (now renamed to FNLP), features include information retrieval (text classification, news clustering), Chinese processing (Chinese word segmentation, part-of-speech tagging, entity name recognition, keyword extraction, dependency parsing time phrase recognition), structured learning (online learning , hierarchical classification, clustering). From a functional standpoint, FNLP is similar to the famous Python Natural language Processing toolkit NLTK, but the latter is less capable of Chinese processing. FNLP is written in Java and can be easily run on a variety of different platforms.
The source code download address is:
https://github.com/xpqiu/fnlp/

5) NiuParser

Semantic analysis System of Chinese sentences Niuparser supports the Chinese sentence-level automatic word segmentation, part-of-speech tagging, named entity recognition, block recognition, component parsing, dependency parsing, and semantics role labeling. Seven major linguistic analysis techniques. All code is developed in the C + + language and does not contain any other open source code. The Niuparser system can be used for research purposes free of charge, but commercial use is subject to commercial licensing.
The source code download address is:
Http://www.niuparser.com/index.en.html

6) LTP

The Language technology platform (Language technology PLATFORM,LTP) is a rich, efficient and precise natural language processing technology that includes Chinese word segmentation, POS tagging, named entity recognition, dependency parsing, and semantic role labeling. LTP has developed an XML-based language processing result representation, and on this basis provides a full set of bottom-up rich and efficient Chinese language processing module (including lexical, syntactic, semantic and other 6 Chinese processing core technology), and based on dynamic link library, DLL), and can be used in the form of network Services (Web service).
The source code download address is:
Https://github.com/HIT-SCIR/ltp

Note:

(a) The LTP word breaker (LTP-CWS) is built on the basis of a structured perceptron (structured Perceptron) algorithm that supports user-defined dictionaries to suit the needs of different users, plus a new personalized (incremental) Training feature that allows users to such as the new field of text, such as word segmentation, self-labeling a small number of sentences (such as correction of LTP participle results), LTP participle module can be re-training a better response to the new field of the word breaker, further improve the accuracy of segmentation in the new field.
The source code download address is:
Https://github.com/HIT-SCIR/ltp-cws

(b) PYLTP is the LTP python package
The source code download address is:
Https://github.com/HIT-SCIR/pyltp

7) ansj Chinese participle

Based on the Google Semantic model + conditional random field model of the Java implementation of Chinese word segmentation, the realization of. Chinese participle. Chinese name recognition. User-defined dictionaries. ANSJ is a Java implementation based on the Ictclas tool, which basically overrides all data structures and algorithms. Use the open source version of the Ictclas dictionary. and part of the manual optimization.
The source code download address is:
Https://github.com/NLPchina/ansj_seg

8) Jieba Chinese participle

Jieba "Stuttering" participle is a python Chinese word-breaker, supporting three types of Word segmentation: (a) accurate mode, trying to cut the sentence most accurately, suitable for text analysis, (b) The whole model, the sentence all can be the word words are scanned out, very fast, but can not resolve ambiguity; (c) Search engine mode, on the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle. In addition, Jieba participle supports traditional participle and custom dictionaries.
The algorithm mainly includes: based on the trie tree structure to achieve efficient word-map scanning, the generation of Chinese characters in the sentence of a directed acyclic graph (DAG), the use of memory search to achieve the maximum probability path calculation, to find the maximum segmentation based on the word frequency; for the non-login words, using the position probability model of Chinese characters, The Viterbi algorithm is used.
The source code download address is:
Https://github.com/fxsjy/jieba

Note: (a) model data generation, see:
HTTPS://GITHUB.COM/FXSJY/JIEBA/ISSUES/7 (b) Cppjieba is the C + + version of the "stuttering" Chinese participle, as detailed in the code:
Https://github.com/yanyiwu/cppjieba (c) cppjiebapy is a wrap for Cppjieba by Swig. If you want to use Python to invoke Cppjieba, you can refer to:
Https://github.com/jannson/cppjiebapy (d) Jieba participle study notes, see:
http://segmentfault.com/a/1190000004061791

9) HANLP

HANLP is a java Chinese language Processing toolkit consisting of a series of models and algorithms that provide complete functions such as Chinese word segmentation, POS tagging, named entity recognition, dependency parsing, keyword extraction, automatic summarization, phrase extraction, pinyin, and Jianfan conversion. Crfsegment supports custom dictionaries, and custom dictionaries have precedence over core dictionaries.
The source code download address is:
http://hanlp.linrunsoft.com/
Https://github.com/hankcs/HanLP

10) BOSONNLP

BOSONNLP is an API SDK invocation interface provided by a start-up company. Features include: Tokenization and part of speech tagging, named-entity recognition, tokenization and compute word weight, automatic de Tection of opinions embodied in text, work out the grammatical structure of sentences, categorization the given articles, Get relative words.
API Download Address is: HTTPS://GITHUB.COM/LIWENZHU/BOSONNLP

11) Pullword Online Word Extraction

Pullword is a permanent, free, deep-learning Chinese online word-drawing
API call Pullword, which contains languages such as Python,r, see: http://api.pullword.com/

Sogo Online participle

Sogo Online Word segmentation adopts the word segmentation method based on Chinese character tagging, and mainly uses the linear chain-link CRF (linear-chain CRF) model. POS tagging modules are based on structured linear models (structured Linear model)
Online Use Address: http://www.sogou.com/labs/webservice/

13) Thulac

Thulac (THU Lexical Analyzer for Chinese) is an open-source Chinese lexical analysis Toolkit, which mainly includes Chinese word segmentation and pos tagging function. The toolkit uses the reordering algorithm (Re-ranking method) based on the word graph (word lattice).
Source code Download address is: http://thulac.thunlp.org

the last Egg (1) CRF Word training tool:
Crfsuite (http://www.chokkan.org/software/crfsuite/)
crf++ (http://taku910.github.io/crfpp/)
Wapiti (Https://github.com/Jekub/Wapiti) or (https://wapiti.limsi.fr/)
Chinesesegmentor (Https://github.com/fancyerii/chinesesegmentor) or (http://fancyerii.github.io/sgdcrf/index.html )
The CRF decoder contains the word breaker part of the crf++ software package, simplifies the crf++ complex code structure, clears the code that the word breaker does not need, and greatly improves the readability and the understanding of the word-breaker decoder. Download Address: http://sourceforge.net/projects/crfdecoder/(2) Chinese word breaker effect evaluation comparison, see:
Https://github.com/ysc/cws_evaluation (3) Chinese dictionary Open Source project (CC-CEDICT)
A Chinese-English dictionary, which can be used in Chinese word segmentation, and has no copyright issues. The Chinese version of Chrome is used in the Chinese word segmentation of this dictionary.
The data and document download address is:
Http://www.mdbg.net/chindict/chindict.php?page=cedict
http://cc-cedict.org/wiki/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.