Chinese Word Segmentation Algorithm note

Source: Internet
Author: User

 

Basic Chinese Word SegmentationAlgorithmMain categories

Dictionary-based methods, statistical-based methods, and rule-based methods (in the legend, there is an understanding-Based Neural Network-expert system, which is not shown below)

1. dictionary-based methods (string matching and mechanical word segmentation)

Definition: match the Chinese character string to be analyzed with the entry in the "big machine dictionary" according to certain policies. If a string is found in the dictionary, the match is successful.

Based on different scanning directions: forward matching and reverse matching

Matching by length: maximum matching and minimum matching

1.1 positive maximum matching thought mm

1. Take the M characters of the Chinese sentence to be split from left to right as the matching field. M indicates the maximum number of entries in the machine dictionary.

2. Search for and match the machine dictionary. If the match is successful, the matching field is split as a word.

If the match fails, the last word of the matching field is removed, and the remaining string is used as the new matching field for re-matching. Repeat the above process until all words are split.

1.2 reverse maximum matching algorithm RMM

This algorithm is a reverse thinking of forward maximum matching. If the matching fails, the first word of the matching field is removed. The experiment shows that the reverse maximum matching algorithm is better than the forward maximum matching algorithm.

1.3 bidirectional maximum matching (bi-directction matching method, BM)

The bidirectional maximum matching method compares the word segmentation result obtained by the forward maximum matching method with the result obtained by the reverse maximum matching method to determine the correct word segmentation method. According to sunm. s. and Benjamin K. t. (1995) studies show that about 90.0% of Chinese sentences fully overlap and are correct with the forward and reverse maximum matching methods, only about 9.0% of sentences have different results obtained by the two segmentation methods, but one of them must be correct (ambiguity detection successful), with less than 1.0% of sentences, or the splitting of the forward and reverse maximum matching methods is incorrect, or the forward and reverse maximum matching methods are different but both are incorrect (ambiguity detection fails ). This is why the bidirectional maximum matching method is widely used in the Practical Chinese Information Processing System.

1.3 set up the segmentation Mark Method

Collect segmentation marks, pre-process segmentation marks before automatic word segmentation, and then use mm and RMM for fine processing.

1.4 optimal matching (OM, forward and reverse Division)

The word segmentation dictionary is ordered by word frequency and the length is specified to reduce the time complexity.

Advantages: easy to implement

Disadvantage: the matching speed is slow. It is difficult to add non-Logon words. Lack of self-learning.

1.2 statistical-based word segmentation (no dictionary word segmentation)

Main Idea: in context, the more times adjacent words appear at the same time, the more likely they are to form a word. Therefore, the probability or frequency of adjacent words can better reflect the credibility of words.

Main statistical models: n-gram and Hidden Markov Model (Hidden Markov Model, hmm)

1.2.1n-gram model idea

The model is based on the assumption that the appearance of the N words is only related to the previous N-1 words, but not to any other words, the probability of a sentence is the product of the probability of occurrence of each word.

We give a word and then guess what the next word is. What do you think of the next word when I say "yanzhaomen? I think it is very likely that everyone will think of "***". Basically, no one will think of "Chen Zhijie. The main idea of the n-gram model is this.

How do we calculate the probability of a sentence T? Suppose T is a word sequence W1, W2, W3 ,... P (t) = P (w1w2w3... Wn) = P (W1) P (W2 | W1) P (W3 | w1w2 )... P (WN | w1w2... Wn-1)

However, this method has two fatal defects: one is that the parameter space is too large to be practical, and the other is that data sparse is serious.

To solve this problem, we introduced the Markov hypothesis that the appearance of a word only depends on the limited one or several words that appear before it.

If a word only depends on one word before it, it is called bigram. That is
P (t) = P (w1w2w3... Wn) = P (W1) P (W2 | W1) P (W3 | w1w2 )... P (WN | w1w2... Wn-1)
≈ P (W1) P (W2 | W1) P (W3 | W2 )... P (WN | Wn-1)

If the appearance of a word only depends on the two words that appear before it, we call it trigram.

Bigram and trigram are the most used in practice, and the results are very good. It is rarely used more than four RMB, because it requires a larger corpus for training, and the data is sparse, the time complexity is high, but the accuracy is not improved much.

Let W1, W2, W3,..., Wn be a string with a length of N. It is required that any word WI is only related to the first two of them to obtain a Trielement probability model.

Likewise, the N-element model assumes that the probability of occurrence of the current word is only related to the N-1 words before it.

1.2.2 Hidden Markov Model

1.3 rule-based word segmentation (semantic-based)

It simulates a person's understanding of a sentence to recognize words. The basic idea is semantic analysis, syntactic analysis, and text word segmentation by using syntactic information and semantic information. It is advantageous to automatically reason and complete the supplement to Unlogged words. Immature.

Concept: Finite State Machine \ syntax constraint matrix \ feature dictionary

1.4 Chinese Word Segmentation Method Based on word tagging

Previous word segmentation methods, whether based on rules or statistics, generally rely on a prepared Word Table (dictionary ). The automatic word segmentation process is to make Word Segmentation decisions through word lists and related information. On the contrary, Word Segmentation Based on word tagging is actually a word building method. That is, the word segmentation process is considered as a question of word labeling in the string. Because each word occupies a definite word building position (I .e., a word bit) when constructing a specific word, it is required that each word has at most four Word Building positions: B (first word ), M (in words), E (at the end of the word), and S (separate words), then the word segmentation result of the following sentence (a) can be directly expressed as (B) the following is a verbatim annotation:

(A) Word splitting result:/Shanghai/Scheduler/n/current/century/end/implementation/per capita/domestic/production/total value/USD five thousand/
(B) Word tagging form: shanghai/B/e n/S/B/end E/S/B/E country /B/E Total/B value/E 5/B k/M USD/e. /S

First, it must be noted that the "word" mentioned here is not limited to Chinese characters. Considering that the real Chinese text inevitably contains a certain number of non-Chinese characters, the "word" mentioned in this article also includes foreign letters, Arabic numerals, punctuation marks and other characters. All these characters are basic units of word formation. Of course, Chinese characters are still the most frequently used characters in this unit set.
The word segmentation process is considered an important advantage of word tagging. It can balance the recognition of word lists and unregistered words. In this word segmentation technology, word lists and unregistered words in the text are all achieved through a unified word tagging process. In terms of the Learning architecture, you do not need to emphasize word vocabulary information or to design specific recognition modules for Unlogged words (such as names, place names, and Organization Names. This greatly simplifies the design of the word splitting system. During the word tagging process, all words are learned based on predefined features to obtain a probability model. Then, based on the degree of closeness between words in the string to be split, a word bit is labeled. Finally, the final word segmentation result is obtained based on the definition of the word location. All in all, in such a word splitting process, Word Segmentation becomes a simple process of word restructuring. However, the word splitting result of this simple processing is satisfactory.

2.1 difficulties in Chinese Word Segmentation

1 \ Ambiguity

The most difficult and core problem: only mechanical matching is used for word segmentation. The accuracy cannot be high and cannot meet high standards.

Intersection type ambiguity \ combination type ambiguity \ true Ambiguity

It is solved by context and semantics.

2 \ unregistered Word Recognition

By lvpei.cnblogs.com

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.