Chinese Word segmentation algorithm

Source: Internet
Author: User
Tags bit definition

Transferred from:

Main classification of Chinese word segmentation basic algorithm

A dictionary-based approach, a statistical-based approach, a rule-based approach, and (in legend, an understanding-based-neural network-expert system, press no table)

1, Dictionary-based method (string matching, mechanical segmentation method)

Definition: Matches the string of Chinese characters to be analyzed in a certain strategy, and matches the entry in a "Big Machine dictionary", if one is found in the dictionary, the match succeeds.

Varies by scan direction: forward matching and inverse matching

Varies by length: maximum match and minimum match

1.1 Positive maximum matching thought mm

1 "From left to right takes the M-character of the Chinese sentence to be divided as the matching field, M is the longest number of entries in the large machine dictionary.

2 "Find the Big Machine dictionary and match it. If the match succeeds, the match field is sliced out as a word.

If the match is unsuccessful, the last word of the matching field is removed, and the remaining string is matched again as a new matching field, repeating the process until all the words are cut out.

1.2 Inverse maximum matching algorithm RMM

The algorithm is the inverse thinking of the forward maximum matching, the match is unsuccessful, the first word of the matching field is removed, and the experiment shows that the inverse maximum matching algorithm is superior to the forward maximum matching algorithm.

1.3 Bidirectional maximum matching method (Bi-directction Matching METHOD,BM)

The two-way maximum matching method is to compare the results of the word segmentation results obtained by the forward maximum matching method with the inverse maximum matching method, thus determining the correct word segmentation method. According to SUNM.S and Benjamin K. T. (1995) The study shows that in Chinese, about 90% of the sentence, the positive maximum matching method and the inverse maximum matching method is completely coincident and correct, only about 9% of the sentence two ways to get the result is different, but there must be a correct (ambiguity detection success), only less than 1% of the sentence, or the forward maximum matching method and reverse The segmentation of the maximal matching method is wrong, or the forward maximum matching method and inverse maximum matching method are different but two are not correct (ambiguity detection fails). This is the reason why the two-way maximum matching method can be widely used in the practical Chinese processing system.

1.3 Establishment of the segmentation mark method

Collect the Shard mark, in the automatic word segmentation before processing the cutting mark, and then with MM, RMM for fine processing.

1.4 Best Bets (OM, forward and reverse)

The lexical dictionary is arranged according to the word frequency size, and the length is indicated, reducing the complexity of time.

Advantages: Easy to implement

Disadvantage: The matching speed is slow. The addition of non-signed words is more difficult to implement. Lack of self-learning.

1.2 Statistics-based participle (no dictionary participle)

Main idea: In context, the more occurrences of adjacent words at the same time, the more likely they are to form a word. Therefore, the probability or frequency of the occurrence of the word and the word can reflect the credibility of the word better.

The main statistical models are: N-ary Grammar model (N-gram), Hidden Markov model (Hidden Markov models, HMM)

1.2.1n-gram model thought

The model is based on the assumption that the occurrence of the nth word is only related to the first N-1 word, but not to any other word, and the probability of the whole sentence is the product of the probability of each word appearing.

We give a word and guess what the next word is. What do you think of the next word when I say the word "photo scandal"? I think everyone will probably think of "Edison", basically no one will think of "Chen Zhijie" it. This is the main idea of the N-gram model.

For a sentence T, how do we calculate the probability of it appearing? Suppose T is by word sequence w1,w2,w3,... WN consists of, then P (T) =p (w1w2w3 ... Wn) =p (W1) P (w2| W1) P (w3| W1W2) ... P (wn| W1w2 ... WN-1)

However, there are two fatal defects in this method: One flaw is that the parameter space is too large to be practical, and the other flaw is that the data is sparse and serious.

To solve this problem, we introduced Markov hypothesis: the appearance of a word depends only on the limited one or several words appearing in front of it.

If the appearance of a word depends only on a word that appears in front of it, then we call it bigram. That
P (T) = P (w1w2w3 ... Wn) =p (W1) P (w2| W1) P (w3| W1W2) ... P (wn| W1w2 ... WN-1)
≈p (W1) P (w2| W1) P (w3| W2) ... P (wn| WN-1)

If the appearance of a word depends only on the two words that appear in front of it, then we call it trigram.

In practice the most used is bigram and trigram, and the effect is very good. More than four yuan is used very little, because training it needs a larger corpus, and data sparse serious, time complexity is high, the accuracy is not much improved.

Set W1,W2,W3,..., WN is a string of length n, which specifies that any word wi only relates to its first two, and obtains the ternary probability model.

In other words, the N-ary model assumes that the present probability of the occurrence of the current word is only related to the N-1 word in front of it.

The thought of 1.2.2 Hidden Markov model

1.3 Rule-based segmentation (semantics-based)

By simulating people's comprehension of sentences, the effect of recognition words is achieved, the basic idea is semantic analysis, syntactic analysis, the use of syntactic information and semantic information to the text segmentation. Automatic inference, and the completion of the addition of non-signed words is its advantage. Not mature.

Specific concepts: finite state machine \ Grammar constraint matrix \ Feature Thesaurus

1.4 Chinese Word segmentation method based on word labeling

The previous word segmentation methods, whether based on rules or statistics, generally rely on a pre-compiled vocabulary (dictionary). The automatic word segmentation process is to make the decision of word segmentation through vocabulary and related information. In contrast, word-tagging-based word segmentation is actually a word-building method. The word segmentation process is regarded as the labeling problem in the string. Since each word occupies a definite word-formation position in the construction of a particular word (that is, the term), if each word has a maximum of four word-building positions: B (the first word), M (in the word), E (ending) and s (separate into words), then the sentence (a) below can be directly expressed as (b) As shown in the Verbatim notation form:

(a) Participle result:/SHANGHAI/Plan/n//century/end/implementation/per capita/domestic/production/gross value/5,000 USD/
(b) The type of notation: On/b sea///////N/S///+/////////////////////////////////////////////E/C////B/F /S

The first thing to say is that the word "words" is not limited to Chinese characters. Considering that Chinese real text inevitably contains a certain number of non-Chinese characters, the "word" in this article also includes characters such as foreign letters, Arabic numerals and punctuation marks. All these characters are the basic unit of word-building. Of course, Chinese characters are still the largest number of characters in this unit collection.
One of the important advantages of the word segmentation process is that it can look at the problem of recognition of thesaurus and non-sign words in a balanced way. In this technique, word-list words and non-login words in the text are implemented by using a unified word labeling process. In the learning architecture, we can not only not specifically emphasize the Thesaurus word information, but also do not specifically design specific non-signed words (such as person names, place names, organization name) identification module. This makes the design of word breakers greatly simplified. In the process of word labeling, all the words are based on the pre-defined characteristics of the word bit characteristics of learning, to obtain a probabilistic model. Then, on the string to be divided, according to the combination of words and the degree of tightness, to get a word bit of the result of labeling. Finally, the final word segmentation results are obtained directly according to the word bit definition. In a word, word segmentation becomes a simple process of word recombination in the process of such a participle. However, the result of this simple processing is satisfactory.

2.1 Difficulties in Chinese word segmentation

1\ ambiguity problem

Most difficult \ The core problem: only with mechanical matching for word segmentation, its accuracy can not be high, can not meet the high standard requirements.

Intersection ambiguity \ Combinatorial ambiguity \ True ambiguity

Rely on context \ Semantics to resolve.

2\ no sign-in Word recognition

Chinese Word segmentation algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.