ZZ mmseg Chinese Word Segmentation Algorithm

Source: Internet
Author: User
Http://leeing.org/2009/11/01/mmseg-chinese-segmentation-algorithm/:

Address: http://technology.chtsai.org/mmseg/

Mmseg: A maximum matching Algorithm Chinese Word Recognition System of two variants

Published on:

Updated on:

Document updated: 2000-03-12

License: free for non-commercial use

Copyright 1996-2006 Chih-hao Tsai (Email: hao520@yahoo.com)

Summary

A problem in the calculation and analysis of Chinese text is that Chinese text lacks the word boundary during printing. Because words are a Basic Semantic Unit, therefore, it is necessary to recognize words in Chinese text for further processing. The purpose of this paper is to develop a Chinese Word Recognition System Based on two variants of the maximum matching algorithm. This system consists of a dictionary and two matching algorithms, and four types of Ambiguity Resolution rules. In a sample composed of 1013 words, the correct recognition rate of this system reaches 98.41%. This article will also discuss the potential application of this system.

Introduction

As Hung and Tzeng (1981) and defrancis (1984) have pointed out, Chinese writing systems map Chinese characters to the spoken language at the same time by phoneme and syllable. Therefore, Chinese characters are differentiated in the written language, on the other hand, by convention, word boundaries are missing in Chinese Printing and writing.

Difficulties in Word Recognition

As words are a basic unit of language, it is necessary to differentiate Words in Chinese text so that words can be computed, analyzed, and processed. However, there are some difficulties in Word Recognition:

First, almost all Chinese characters may be a single word. Furthermore, they can form multiple words with other Chinese characters, which leads to a large number of Word Segmentation ambiguities. Second, in modern Chinese, synthetic words are dominant word creation solutions. It is often difficult to tell whether a low-frequency compound word is a word or a phrase. Distinguishing inherent words is also a problem. Finally, some specific morphological structures such as repetition and "a not a" also need to be considered.

If not, for example, e.g. huang, Ahrens, & Chen, 1993; sproat and Shih, 1990). Most Word Recognition Methods share a common algorithm (such as Chen & Liu, 1992; Fan & Tsai, 1988; Yeh & Lee, 1991), the basic strategy is to use a large number of word sets stored in a pre-compiled dictionary to match the input Chinese characters to find all (or part) possible word segmentation methods, since there is usually only one correct word segmentation method, ambiguity should be eliminated.

Max Matching Algorithm and its variants

In different studies, their ambiguity elimination methods are also different. A simple and effective method is the maximum matching algorithm (Chen & Liu, 1992). The maximum matching algorithm can be in multiple forms.

Simple maximum matching algorithm. The basic form is to parse the ambiguity of a single word (Yi-Ru Li, personal communication, January 14,199 5). For example, suppose C1, C2 ,.... Represents Chinese characters in a string. We first start with a string and want to know How to differentiate words. First, we will search for a dictionary to see whether _ C1 _ is a word composed of a single Chinese character, and then search for _ c1c2 _ to see if it is a word composed of two Chinese characters, and so on. Until the longest match in the dictionary is found. The most likely word is the longest match. Take the word and continue the process until the last word in the string is recognized.

Complex maximum matching algorithm. Another maximum matching algorithm is proposed by Chen and Liu (1992), which is more complex than the basic form. Their maximum matching rules indicate that the most likely word splitting scheme is three words... Again, we start from the header of a string and look for a word splitting solution. If there is an ambiguous word segmentation (for example, _ C1 _ is a word, but _ c1c2 _ is also a word, and so on ), then let's look at two more words to find all possible three chunks starting with _ C1 _ or _ c1c2. For example, if there is a possible three-word chunks:

1. _ C1 _ C2 _ c3c4 _

2. _ c1c2 _ c3c4 _ C5 _

3. _ c1c2 _ c3c4 _ c5c6 _

The chunk with the maximum length is the third. The first word, _ c1c2 _ In the third chunk, is considered correct. We accept the word and repeat it forward from the Chinese character C3 until the last word in the string is recognized. Chen and Liu (1992) claim that this rule has achieved 99.69% accuracy and 93.21% of ambiguity can be eliminated by this rule.

Other algorithms for ambiguity Elimination

In addition to the maximum matching algorithm, many other algorithms for eliminating ambiguity have also been obtained. Various information is used in the process of eliminating ambiguity, such as probability and statistics (Chen & Liu, 1992; Fan & Tsai, 1988), syntax (Yeh & Lee, 1991), as well as word morphology (Chen & Liu, 1992), most of them need a dictionary with well-constructed frequencies of Chinese characters and phrases, and syntax classification of words, and a set of syntax or morphology (for example, the Chinese Knowledge Information Processing team [ckip], 1993a, 1993b, 1993c ).

Mmseg System Overview

The mmseg system implements the simple and complex form of the maximum matching algorithm discussed earlier. Furthermore, in order to eliminate the ambiguity not eliminated by the complex maximum matching algorithm, three ambiguity elimination rules are implemented.

One of them was proposed by Chen and Liu (1992), and the other two are new practices. These rules will be discussed later. This system does not have special rules to handle inherent names and special morphological structures such as duplicates and "a not a" structures.

It does not mean that mmseg is not a professional system designed with a 100% correct recognition rate. To a certain extent, mmseg should be regarded as a general platform to test new algorithms to eliminate ambiguity. However, we can see that even the current version of mmseg has achieved very high accuracy, which is equivalent to some algorithms published in academic journals.

Dictionary

The first part of the dictionary consists of 124499 multi-word entries. The length of the dictionary entry ranges from two Chinese characters to eight Chinese characters. You can view the word length distribution in Appendix. A dictionary is an organized list of simple Chinese character strings. No additional information is attached to each string. The dictionary is based on a list of 137450 Chinese words (Tsai, 1996c) maintained by the author ), this list is created by merging Chinese word lists that can be obtained on the Internet (Tsai, 1996a ).

The second part of the dictionary consists of 13060 Chinese characters and their usage frequency (Tsai, 1996b). The frequency of Chinese characters is used in the last ambiguity elimination rule.

Matching Algorithm

Simple match: for the Chinese character CN in a string, use a dictionary to match the substring starting with CN and find all possible matches.

Complex match: for the Chinese character CN in a string, search for the three-character chunks that may start with CN, whether or not they are ambiguous with the first word. The three-character chunks are formed only when the first word is ambiguous.

Ambiguity elimination rules

Four types of ambiguity elimination rules are used. The maximum matching rule is used in simple and complex matching algorithms to eliminate word segmentation ambiguity. The remaining three rules are not (and cannot) applied to simple matching algorithms.

Rule 1: Maximum match (Chen & Liu 1992 ).

A) Simple maximum match: Get the word with the maximum length.

B) complex maximum match: obtain the first word from the chunk with the maximum length. If there is more than one chunk, apply the next rule.

Rule 2: maximum average word length (Chen & Liu, 1992 ). At the end of each string, it is possible to get a chunk with only one or two words. For example, the chunks below have an equal length and the word length variance is equal.

 
1. _ C1 _ C2 _ C3 _
 
2. _ c1c2c3 _
 
 

Rule 2 obtains the first word from the chunk with the maximum average word length. In the preceding example, it selects _ c1c2c3 _ from the second chunk _. This rule is assumed that more words can be encountered than words.

This rule is useful only when one or more words in a chunk are empty. When chunks is a real three-word chunks, this rule is useless. Because the three-word chunks with the same total length have the same average length, we need other solutions.

Rule 3: Minimum Variance of the word length (Chen & Liu 1992 ). There are few ambiguous conditions that Rule 1 and rule 2 cannot be parsed. For example, the two chunks have the same length:

 
1. _ c1c2 _ c3c4 _ c5c6 _
 
2. _ c1c2c3 _ C4 _ c5c6 _
 
 
 
Rule 3 takes the minimum variance of the word length in the chunk as the word. In the above example, it takes _ c1c2 _ from the first chunk _. This rule is exactly the same as that proposed by Chen & Liu (1992. (However, they apply the rule immediately after rule 1 ). This rule assumes that the word length is normally evenly distributed. If more than one chunks has the minimum word length variance, the next rule is applied.

Rule 4: The biggest sum of the degree of freedom of the phoneme of a single word. This example shows two chunks with the same length, variance, and average word length.

1. _ C1 _ C2 _ c3c4 _
2. _ C1 _ c2c3 _ C4 _

Both chunks have a single word and a double word. Which one is more like the correct one? Here we will focus on single words. Chinese characters have different degrees of freedom in terms of morphology. Some rarely used Chinese characters are used as free morhemes, but others have greater degrees of freedom. The frequency of occurrence of a Chinese character can be used as an index of the degree of freedom of the phoneme. A high-frequency Chinese character is more likely to be a single word, and vice versa.

The formula used to calculate the sum of the degree of freedom of the phoneme is to calculate the frequency of each word in a chunk. The conversion principle of this algorithm is that the difference between the same number of frequencies does not affect the consistency of all frequency ranges.

Rule 4 selects the first word with the maximum frequency in the chunk. Since it is very likely that two Chinese characters have the same frequency value, there should be no ambiguity after this rule is applied.

Although the year is long, it is very enlightening for caidiao of NPL.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.