This paper first introduces the basic principle of Chinese word segmentation , and then introduces the Chinese word segmentation tools , such as Jieba,snownlpand Thulac, which are popular in China. ,Nlpir, the above Word tool has been in GitHub Open source, follow-up will also be attached GitHub links for reference. 1. Introduction of Chinese word segmentation principle 1.1 Chinese participle overview
Chinese participle (Chinese word segmentation) refers to a sequence of Chinese characters into a single word . participle is the process of combining successive word sequences into word sequences according to certain specifications . 1.2 Chinese Word segmentation method Introduction
The existing methods can be divided into three categories: the Word segmentation method based on string matching , the word segmentation method based on understanding and the method of segmentation based on statistics . 1.2.1 Segmentation method based on string matching
The Word segmentation method based on string matching is also called the Machine segmentation method , it is the Chinese character string to be analyzed according to certain strategy and a " fully large" machine dictionary If a string is found in the dictionary , the match succeeds (identifying a word).
According to the different scanning direction , the string matching segmentation method can be divided into forward matching and reverse matching , according to the different lengths of priority matching , can be divided into the largest (longest) matching and Minimum (shortest) matching , according to whether or not with the process of POS tagging , can be divided into simple word segmentation and Word segmentation and POS tagging integrated approach . Common string-matching methods include the following:
(1) The forward maximum matching method (from left to right direction);
(2) reverse Maximum matching method (from right to left direction);
(3) minimum segmentation (minimum number of words cut in each sentence);
(4) bidirectional maximum matching (from left to right, right to left two times scan)
The advantage of this kind of algorithm is that the speed is fast , the time complexity can be maintained in O (n), the result is simple, the effect is fair, but the effect is not good to the ambiguity and the unsigned word processing. 1.2.2 Segmentation method based on understanding
the method of Word segmentation based on understanding is to make the computer simulate the human understanding of the sentence and achieve the effect of the recognition word. The basic idea is to make syntactic and semantic analysis at the same time , and use syntactic information and semantic information to deal with ambiguity phenomenon. It usually consists of three parts: the segmentation subsystem , the syntactic system , the general control part . Under the coordination of the General control part , the segmentation subsystem can get the syntactic and semantic information about words and sentences to judge the ambiguity of word segmentation, that is, it simulates the process of human understanding of sentences. This kind of word segmentation method needs to use a lot of language knowledge and information . Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into the form of machine direct reading, so the word segmentation system based on understanding is still in the experimental stage . 1.2.3 Method of segmentation based on statistics
The method of segmentation based on statistics is to learn the law of Word segmentation by using statistical machine learning model under the premise of a large number of text that has been participle , so as to realize The segmentation of unknown text . For example, maximum probability participle method and maximum entropy Word segmentation method . With the establishment of large-scale corpus, the research and development of statistical machine learning method, the Chinese Word segmentation method based on statistics has become the mainstream method gradually.
The main statistical models are:N-ary grammars (n-gram), Hidden Markov models (Hidden Markov model, HMM), maximum entropy model (ME) , Conditions with the airport model (Conditional Random FIELDS,CRF) and so on.
In the practical application, the word segmentation system based on statistics needs to use Word segmentation dictionary to string matching participle , while using statistical method to identify some new words , the string frequency statistics and The combination of string matching , not only play the character of fast and high efficiency of matching participle , but also make use of the advantages of no dictionary participle to identify new words and automatically eliminate ambiguity . 2. Introduction of Chinese word segmentation tools 2.1 Jieba (GitHub star number 9003)
Jieba participle is the most domestic use of Chinese Word segmentation tool (GitHub link: https://github.com/fxsjy/jieba). Jieba participle supports three kinds of modes :
(1) precise mode : Try to cut the sentence in the most precise way, which is suitable for text analysis ;
(2) full mode : All the words in the sentence can be scanned out, the speed is very fast, but can not solve the ambiguity ;
(3) search engine mode : On the basis of accurate mode, the long word segmentation, improve recall rate , suitable for search engine participle .
Jieba participle process mainly involves the following several algorithms:
(1) based on the prefix dictionary to achieve efficient word map scanning, the formation of all possible words in the sentence of Chinese characters are composed of the direction of the ring-free map (DAG);
(2) using dynamic programming to find the maximum probability path and finding the maximal segmentation combination based on word frequency ;
(3) The HMM model based on the ability of Chinese characters to be used for the unregistered words is adopted, and the Viterbi algorithm is used to compute the word.
(4) Making POS tagging based on Viterbi algorithm;
(5) extracting keywords based on TF-IDF and Textrank models;
The test code looks like this:
#-*-Coding:utf-8-*-
"" "
jieba Participle Test" ""
import jieba
#全模式
test1 = Jieba.cut ("West Lake scenery is very good, is a tourist destination." ", Cut_all=true)
print (" Full mode: "+" | "). Join (test1))
#精确模式
test2 = Jieba.cut ("West Lake scenery is very good, is a tourist destination." ", Cut_all=false)
print (" exact mode: "+" | "). Join (test2)
#搜索引擎模式
test3= jieba.cut_for_search ("West Lake scenery is very good, is a tourist attraction, attracts a large number of visitors every year." "")
print ("Search engine mode:" + "|"). Join (TEST3))
The test results are shown in the following illustration:
2.2 SNOWNLP (GitHub star number 2043)
SNOWNLP is a python-written class library (HTTPS://GITHUB.COM/ISNOWFY/SNOWNLP) that can easily handle Chinese text content and is subject to Textblob inspired by the writing. SNOWNLP mainly includes the following functions:
(1) Chinese participle (character-based generative Model);
(2) pos tagging (3-gram HMM);
(3) affective Analysis (simple analysis, such as evaluation information);
(4) text categorization (Naive Bayes)
(5) conversion to Pinyin (trie tree Implementation of the maximum match)
(6) simplified conversion (maximum matching of trie tree implementations)
(7) text keyword and text Digest extraction (Textrank algorithm)
(8) calculate The frequency of documents (TF, Term Frequency) and reverse document frequencies (IDF, inverse document Frequency)
(9)tokenization(divided into sentences)
(a) text similarity calculation (BM25)
The biggest feature of SNOWNLP is that it is especially easy to use, it can get a lot of interesting results when dealing with Chinese text, but many functions are simpler and need further improvement.
The test code looks like this:
#-*-Coding:utf-8-*-
"" "
SNOWNLP Test
" "" "
from SNOWNLP import snownlp
s = snownlp (U ' hangzhou West Lake scenery is very good, is a tourist attraction, Attracts a large number of tourists every year. ')
#分词
Print (s.words)
#情感词性计算
print ("The probability of a positive part of the text's emotional speech:" + str (s.sentiments))
text = u "'
West Lake, Located in the west of Hangzhou, Zhejiang Province, China is one of the first national key scenic spots and China's top ten scenic spots.
It is one of the main ornamental freshwater lakes in mainland China, and it is also the only one of the few and the only lake-like cultural heritage in the World Heritage list today. The
West Lake is surrounded by mountains on three sides, with an area of about 6.39 square kilometres, something about 2.8 km wide and about 3.2 km north and south, and nearly 15-kilometer around the lake.
The lake was Lonely Mountain, White Dike, Su Dike, Yanggong Embankment, according to the size of the outside West Lake, Xi ' an lake, North Lake, Xiaonanhu and Yeu, such as five of the water,
Su Dike, White dike over the lake, small Yingzhou, pavilion, Ruangong Pier three small islands in the West Lake, The Leifeng Pagoda and the Gem Mountain Baochu Pagoda separate the lake,
thus forming a "mountain, Hita, three islands, three dike, five lakes" basic pattern.
'
s2 = snownlp (text)
#文本关键词提取
Print (S2.keywords (10))
The test results are shown in the following illustration:
2.3 Thulac (GitHub star number 311)
Thulac(THU lexical Analyzer for Chinese) A set of Chinese lexical analysis kits developed by the natural language processing and social Humanities Computing Laboratory of Tsinghua University (GitHub Link: https://github.com/thunlp/THULAC-Python), with Chinese word segmentation and pos tagging function. Thulac has several characteristics as follows:
(1) strong ability . With our integration of the world's largest artificial participle and POS tagging Chinese corpus (about 58 million words) training, the model tagging ability is strong .
(2) accurate rate is high . The tool package in the standard dataset Chinese Treebank (CTB5) on the F1 value of up to 97.3%, pos annotation F1 value can reach 92.9%, and the data set the best method effect.
(3) faster. At the same time the word and pos tagging speed of 300kb/s, can handle about 150,000 words per second. Only the speed of word segmentation can reach 1.3mb/s.
The Thu Pos tag Set (Common edition) looks like this:
n/noun np/person name ns/name nz/Other proper name
m/numeral q/quantifier mq/quantity word t/time word f/locality s/place word v/
verb A/adjective d/adverb h/antecedent component k/after the component i/idiom j/
abbreviation r/ CI-conjunctions p/prepositions u/auxiliary y/modal auxiliary words
e/interjection o/onomatopoeia g/morpheme punctuation w/Other
The test code (Python version) looks like this:
#-*-Coding:utf-8-*-
"" "
thulac Participle Test" ""
import Thulac
#默认模式, participle of pos tagging at the same time
test1 = Thulac.thulac ()
Text1 = Test1.cut ("West Lake scenery is very good, is a tourist attraction." ")
print (Text1)
#只进行分词
test2 = Thulac.thulac (seg_only=true)
Text2 = Test2.cut (" West Lake scenery is very good, is a tourist attraction. " ")
print (TEXT2)
The test results are shown in the following illustration:
2.4 Nlpir (GitHub star number 811)
Nlpir Word System (formerly known as the 2000 release of the Ictclas lexical analysis system , Gtihub Link: https://github.com/NLPIR-team/NLPIR), is by the Chinese word segmentation system developed by Dr. Zhang Huaping of Beijing Institute of Technology has been perfected for more than more than 10 years, and has rich function and powerful performance. Nlpir is a set of software for processing and processing raw text sets , which provides visual display of middleware processing effect , and can be used as a processing tool for small scale data . The main functions include: Chinese word segmentation , pos tagging , named entity recognition , user dictionaries , new word discovery and keyword extraction functions. This test uses the pynlpir(nlpir python version , GitHub link: https://github.com/tsroten/pynlpir)
The test code looks like this:
#-*-Coding:utf-8-*-
"" "
pynlpir Participle Test" ""
import pynlpir
#打开分词器
pynlpir.open () Text1
= "West Lake scenery is very good, is a tourist attraction, attracts a large number of visitors every year." "
#分词, the default open participle and pos callout function
test1 = pynlpir.segment (Text1)
#print (test1)
print (' 1. Default participle mode: \ n ' + str (test1) )
#将词性标注语言变更为汉语
test2 = pynlpir.segment (text1,pos_english=false)
print (' 2. Chinese callout mode: \ n ' + str (test2))
#关闭词性标注
test3 = pynlpir.segment (text1,pos_tagging=false)
print (' 3. No pos tagging mode: \ n ' + str (TEST3))
The test results are shown in the following illustration: