The principle and realization of Chinese word segmentation

Source: Internet
Author: User
Tags bit definition

Three main mainstream word segmentation methods: Dictionary-based approach, rule-based approach, and statistical-based approach.

1. Methods based on rules or dictionaries

Definition: Matches the string of Chinese characters to be analyzed in a certain strategy, and matches the entry in a "Big Machine dictionary", if one is found in the dictionary, the match succeeds.

    1. Varies by scan direction: forward matching and inverse matching
    2. Varies by length: maximum match and minimum match
1.1 Positive maximum matching thought mm
    1. From left to right, I take the M characters of Chinese sentences as match fields, M is the longest number of entries in the Dictionary of large machines.
    2. Find the big machine dictionaries and match them:
      • If the match succeeds, the match field is sliced out as a word.
      • If the match is unsuccessful, the last word of the matching field is removed, and the remaining string is matched again as a new matching field, repeating the process until all the words are cut out.

Give me a chestnut:
Now, we want to " Nanjing Mayor River Bridge " This sentence to participle, according to the principle of maximum matching:

    1. First from the sentence to take out the first 5 characters "Nanjing Changjiang", the 5 characters to the dictionary match, found that there is no word, then shorten the number of words, take the first four "Nanjing mayor", found that the word library has this word, the word cut down;
    2. The remaining three words "river bridge" again to the maximum matching, will be cut into the "river", "Bridge";
    3. Complete sentence segmentation: Nanjing Mayor, Jiang, Bridge ;
1.2 Inverse maximum matching algorithm rmm

The algorithm is the inverse thinking of the forward maximum matching, the match is unsuccessful, the first word of the matching field is removed, and the experiment shows that the inverse maximum matching algorithm is superior to the forward maximum matching algorithm.

or the chestnut:

    1. Take out the "Nanjing Yangtze River bridge" after the four words "Yangtze River Bridge", found that there is a match in the dictionary, cut down;
    2. The remainder of the "Nanjing" participle, the overall result is: Nanjing, Yangtze River bridge
1.3 Bidirectional Maximum matching method (Bi-directction Matching METHOD,BM)

The two-way maximum matching method is to compare the results of the word segmentation results obtained by the forward maximum matching method with the inverse maximum matching method, thus determining the correct word segmentation method.

According to SUNM.S and Benjamin K. T. (1995) The study shows that in Chinese, about 90% of the sentence, the positive maximum matching method and the inverse maximum matching method is completely coincident and correct, only about 9% of the sentence two ways to get the result is different, but there must be a correct (ambiguity detection success), only less than 1% of the sentence, or the forward maximum matching method and reverse The segmentation of the maximal matching method is wrong, or the forward maximum matching method and inverse maximum matching method are different but two are not correct (ambiguity detection fails). This is the reason why the two-way maximum matching method can be widely used in the practical Chinese processing system.

or the chestnut:
Two-way maximum matching, that is, all the possible maximum words are divided, the above sentence can be classified as: Nanjing, Nanjing Mayor, Yangtze River bridge, river, bridge

1.4 Establishment of the segmentation mark method

Collect the Shard mark, in the automatic word segmentation before processing the cutting mark, and then with MM, RMM for fine processing.

1.5 Best Bets (OM, forward and reverse)

The lexical dictionary is arranged according to the word frequency size, and the length is indicated, reducing the complexity of time.

Advantages: Easy to implement
Disadvantage: The matching speed is slow. The addition of non-signed words is more difficult to implement. Lack of self-learning.

1.6 Word-wise traversal method

This approach is to search the words in the thesaurus from long to short, one after the other in the material to be processed, until all the words are cut out.
To deal with the above basic mechanical word segmentation method, there are two-way scanning method, two times scanning method, Word segmentation based on frequency Statistics, association-backtracking method and so on.

2, statistics-based participle

With the establishment of large-scale corpus, the research and development of statistical machine learning methods, the Chinese Word segmentation method based on statistics has gradually become the mainstream method.

main idea : to think of each word as the smallest unit of the word assembly of each word, if the connected words in different text appears in the number of times, it proves that the linked word is likely to be a word. So we can use the word and word adjacent to the frequency to reflect the reliability of the word, statistical corpus of the adjacent co-occurrence of the combination of each word frequency, when the combination frequency is higher than a certain threshold, we can think that this word group may constitute a word.

Main Statistical Models : N-ary Grammar model (N-gram), Hidden Markov model (Hidden Markov model, HMM), Maximum entropy model (ME), Conditional random field model (Conditional Random FIELDS,CRF), etc.

advantages : In the actual application is often the word segmentation dictionary string matching Word segmentation and statistical word segmentation can better identify the new words together to use, so that both the matching segmentation is not only fast, but also high efficiency characteristics, but also to fully utilize statistical participle in the context of the identification of words, The advantages of automatically eliminating ambiguity.

2.1 N-gram Model thought

The model is based on the assumption that the occurrence of the nth word is only related to the first N-1 word, but not to any other word, and the probability of the whole sentence is the product of the probability of each word appearing.

We give a word and guess what the next word is. What do you think of the next word when I say the word "photo scandal"? I think everyone will probably think of "Edison Chen", basically no one will think of "Chen Zhijie" bar, N-gram model of the main idea is this.

For a sentence T, how do we calculate the probability of it appearing? Suppose T is by word sequence w1,w2,w3,... WN, then P(T)=P(W1W2W3…Wn)=P(W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1)
However, there are two fatal defects in this method: One flaw is that the parameter space is too large to be practical, and the other flaw is that the data is sparse and serious. To solve this problem, we introduced Markov hypothesis: the appearance of a word depends only on the limited one or several words appearing in front of it. If the appearance of a word depends only on a word that appears in front of it, then we call it bigram. That

P (T) =p (w1w2w3 ... Wn) =p (W1) P (w2| W1) P (w3| W1W2) ... P (wn| W1w2 ... WN-1)
≈p (W1) P (w2| W1) P (w3| W2) ... P (wn| WN-1)

If the appearance of a word depends only on the two words that appear in front of it, then we call it trigram.

In practice the most used is bigram and trigram, and the effect is very good. More than four yuan is used very little, because training it needs a larger corpus, and data sparse serious, time complexity is high, the accuracy is not much improved. The average small company, with a model of two yuan is enough, like Google, the giant, but also only used about four yuan of the degree, it's computing power and space requirements are too big.

In other words, the N-ary model assumes that the present probability of the occurrence of the current word is only related to the N-1 word in front of it.

2.2 HMM and CRF model thought

In the past, the word segmentation method, whether based on rules or statistics, usually relies on a pre-compiled thesaurus (dictionary), and the automatic word segmentation process is to make the decision of word segmentation through vocabulary and related information. On the contrary,

Word segmentation based on character labeling (or sequence labeling) is actually a word-building method, that is, the word segmentation process is regarded as the label problem in the string.

Since each word occupies a definite word-formation position in the construction of a particular word (that is, the term), if each word has a maximum of four word-building positions: B (the first word), M (in the word), E (ending) and s (separate into words), then the sentence (a) below can be directly expressed as (b) As shown in the Verbatim notation form:

(甲)分词结果:/上海/计划/N/本/世纪/末/实现/人均/国内/生产/总值/五千美元/ (乙)字标注形式:上/B海/E计/B划/EN/S 本/s世/B 纪/E 末/S 实/B 现/E 人/B 均/E 国/B 内/E生/B产/E总/B值/E 五/B千/M 美/M 元/E 。/S

The first thing to say is that the word "words" is not limited to Chinese characters. Considering that Chinese real text inevitably contains a certain number of non-Chinese characters, the "word" in this article also includes characters such as foreign letters, Arabic numerals and punctuation marks. All these characters are the basic unit of word-building. Of course, Chinese characters are still the largest number of characters in this unit collection.

One of the important advantages of the word segmentation process is that it can look at the problem of recognition of thesaurus and non-sign words in a balanced way.

In this technique, word-list words and non-login words in the text are implemented by using a unified word labeling process. In the learning architecture, we can not only not specifically emphasize the Thesaurus word information, but also do not specifically design specific non-signed words (such as person names, place names, organization name) identification module. This makes the design of word breakers greatly simplified. In the process of word labeling, all the words are based on the pre-defined characteristics of the word bit characteristics of learning, to obtain a probabilistic model. Then, on the string to be divided, according to the combination of words and the degree of tightness, to get a word bit of the result of labeling. Finally, the final word segmentation results are obtained directly according to the word bit definition. In a word, word segmentation becomes a simple process of word recombination in the process of such a participle. In the learning framework, the design of the word breaker is particularly simple, since it is not necessary to deliberately emphasize the information of the thesaurus, nor to specifically design specific modules for non-signed words.

2001 Lafferty based on the maximum Entropy model (MEM) and Hidden Markov model (HMM), this paper presents a model of the non-direction graph-Conditional random field (CRF), which can improve the joint probability of the marker sequence to the maximum extent given the observed sequence of the required markers. A statistical model commonly used to slice and annotate serialized data. The CRF algorithm theory see my other blog, here is not to repeat.

2.3 Implementation based on statistical word segmentation method

Now that we've introduced the language model from the full probability formula, how does it really work?
We have a statistical language model, the next step is to divide the sentence to find the highest probability of the word, that is, to divide the sentence, the most primitive direct way, is the sentence of all possible word segmentation and then find the highest probability of the word combination. But this kind of exhaustive method is obviously very cost-consuming, so we have to find a way to achieve our goal in other ways.

Think carefully, if we put each word as a node, every two words between the lines as the edge, for the sentence "Long live the Chinese people", we can construct a word structure as follows:

We need to find the probability of the largest word structure, can be seen as a dynamic programming problem, that is, to find the maximum probability structure of the whole sentence, for its substring should also be the maximum probability.

For the word at any position T in the sentence, we want to find all the possible phrase forms from the dictionary, such as the first word in, may have: Chinese, China, Chinese, three kinds of combinations, The fourth word may be only the people, after finishing, our participle structure can be converted into the following graph model:

All we have to do is find a path with the greatest probability. We assume that Ct(k) Represents the first t The position of the word may be k, then the state transfer equation can be written:

where k is a possible word at the current position, L is a possible word in the previous position, M is the possible value of L, and with the state transfer return, it is easy to write the recursive dynamic programming code (this equation is actually a well-known Viterbi algorithm, usually used in the hidden Markov model).

#!/usr/bin/python# Coding:utf-8"" " Viterbi " "" fromLmImportLanguagemodel class Node(object):  "" The node in the graph to "" "   def __init__(Self,word):    # The score of the current node as a node in the left and right pathSelf.max_score =0.0    # Previous optimal nodeSelf.prev_node =None    # The words that the current node representsSelf.word = Word class Graph(object):  "" " Graph" ""   def __init__(self):    # The sequence in the graph is a set of hash setsSelf.sequence = [] class dpsplit(object):  "" " Dynamic planning participle " ""   def __init__(self):SELF.LM = Languagemodel (' RenMinData.txt ') self.dict = {} self.words = [] Self.max_len_word =0Self.load_dict (' Dict.txt ') Self.graph =NoneSelf.viterbi_cache = {} def get_key(self, T, K):    return ' _ '. join ([str (t), str (k)]) def load_dict(self,file):     withOpen (file,' R ') asF: forLineinchF:word_list = [W.encode (' Utf-8 ') forWinchList (Line.strip (). Decode (' Utf-8 '))]ifLen (word_list) >0: self.dict["'. Join (word_list)] =1          ifLen (word_list) > Self.max_len_word:self.max_len_word = Len (word_list) def creategraph(self):    "" creates a forward graph according to the input sentence "" "Self.graph = Graph () forIinchRange (len (self.words)): Self.graph.sequence.append ({}) word_length = Len (self.words)# Create a set of possible words for each word location     forIinchRange (Word_length): forJinchRange (Self.max_len_word):ifi+j+1> Len (self.words): BreakWord ="'. Join (self.words[i:i+j+1])ifWordinchSelf.dict:node = node (word)# Assign a location to the word by its endSelf.graph.sequence[i+j][word] = node# Add an end empty node for easy calculationEnd = Node (' # ') Self.graph.sequence.append ({' # ': End})# for S in self.graph.sequence:    # for I in S.values ():    # print I.word,    # print '-'    # exit ( -1)   def split(self, sentence):Self.words = [W.encode (' Utf-8 ') forWinchList (Sentence.decode (' Utf-8 ')] Self.creategraph ()# Calculate the maximum score for all nodes in the graph according to the Viterbi dynamic programming algorithmSelf.viterbi (Len (self.words),' # ')# Output Branch largest nodeEnd = self.graph.sequence[-1][' # '] node = end.prev_node result = [] whileNode:result.insert (0, Node.word) node = Node.prev_nodePrint "'. Join (Self.words)Print "'. Join (Result) def Viterbi(self, T, K):    "" " t position, is the best path probability of the word k " ""    ifSelf.get_key (T,k)inchSelf.viterbi_cache:returnSelf.viterbi_cache[self.get_key (t,k)] node = self.graph.sequence[t][k]# t = 0 case, i.e. the first word of a sentence    ift = =0: Node.max_score = Self.lm.get_init_prop (k) Self.viterbi_cache[self.get_key (t,k)] = Node.max_scorereturnNode.max_score prev_t = T-len (K.decode (' Utf-8 '))# The position of the current node is beyond the beginning of the sentence, you do not need to calculate the probability    ifprev_t = =-1:return 1.0    # Get the previous state all possible kanjiPre_words = Self.graph.sequence[prev_t].keys () forLinchPre_words:# state transition probabilities from l to KState_transfer = Self.lm.get_trans_prop (k, L)# The current state score is the probability of the last optimal path multiplied by the current state transition probabilityScore = Self.viterbi (prev_t, l) * state_transfer Prev_node = self.graph.sequence[prev_t][l] Cur_score = score + Prev_node.max_scoreifCur_score > Node.max_score:node.max_score = Cur_score# Save the previous optimal node of the current node for backtracking outputNode.prev_node = Self.graph.sequence[prev_t][l] Self.viterbi_cache[self.get_key (t,k)] = Node.max_scorereturnNode.max_score def main():DP = Dpsplit () dp.split (' People's Bank of China ') Dp.split (' The People's Republic of China was established today ') Dp.split (' striving to increase the income of residents ')if__name__ = =' __main__ ': Main ()

Some of the points that require special attention are:
1. Do recursive computing must use the cache, the Jianxian of the problem, refer to the Dynamic planning introduction practice.
2. The previous position of the current position should be obtained using the length of the word in the current position.
3. The above code is only used as an experiment, the principle is not a problem, but the performance is poor, production needs to build index to improve performance.
4. The code ignores English words, unregistered words, and punctuation marks, but the improvements are not complicated and the reader can consider them at their own discretion.

The output of the code is:

中国人民银行:中国 人民 银行中华人民共和国今天成立了:中华人民共和国 今天 成立 了努力提高居民收入:努力 提高 居民 收入
Reference documents full-text database of Chinese master's degree thesis

Hu Huating; Research on adaptive method of Chinese word segmentation field based on active learning [D]; Beijing Jiaotong University; 2015
Decon; Research and application of Chinese word segmentation based on parallel corpus in English and Chinese [D]; Dalian University of Technology, 2012
Liu Weili; Research on Chinese text classification based on particle swarm optimization and support vector machine [D]; Henan University of Technology; 2010
Huang; A comparative study of open source Chinese word breakers [D]; Zhengzhou University; 2013

Full-text database of doctoral dissertations in China

Wang Jian; Research on some key technologies in Chinese processing [D]; Fudan University; 2004
He Bainhua; Research on automatic Chinese word segmentation and machine translation [D]; South University of South; 1993
News; Research and application of lexical analysis in Chinese [D]; Dalian University of Technology; 2010

Web Articles

Http://sobuhu.com/ml/2012/12/23/chinese-word-spliter.html

The principle and realization of Chinese word segmentation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.