Chapter 4 word segmentation principles Chinese Word Parsing Algorithm
Pre-processing of indexed webpage information includes webpage analysis and inverted file indexing. Automatic Chinese segmentation is a prerequisite for webpage analysis. A document is composed of index words called feature items. Web analysis is to represent a document Is the process of feature items. When extracting feature items, Chinese faces different problems from English. There is a major difference between Chinese information and English information: English words are separated by spaces. In Chinese text There is no natural separator between words. Most Chinese words are composed of two or more Chinese characters, and the statements are continuously written. This requires that the entire sentence be cut into a small value before Automatic Analysis of Chinese text. The unit of words, that is, Chinese word segmentation (or Chinese Word Segmentation ).).
To illustrate how"My notebook"In this way, statements written consecutively are segmented"Me","Of"And"Notebook"Three vocabulary units. In the retrieval and document classification systems, the speed of the Automatic Word splitting system affects the efficiency of the entire system. There are two types of Chinese Information Retrieval: Word-based retrieval and word-based retrieval. Single-word search system . Obtain the index of each word during retrieval, and then perform appropriate logical operations to obtain the retrieval result. However, the vocabulary-based retrieval system builds indexes on the vocabulary, and one hit is detected during word aggregation.
1,AlgorithmIntroduction
The basic methods for Automatic Word Segmentation include string-matching-Based Word Segmentation and statistical-based word segmentation.
String Matching-Based Word Segmentation
This method is also called the mechanical word segmentation method. It matches the Chinese character string to be analyzed with a sufficiently large dictionary entry according to certain policies. If a string is found in the dictionary, the match is successful (a word is recognized ). Based on the length of the scanner, it can be divided into the maximum or maximum matching, and the minimum or minimum matching; based on whether it is combined with the part-of-speech tagging process, it can also be divided into a simple word segmentation method and an integrated method combining word segmentation and annotation. Several common mechanical word segmentation methods are as follows:
<! -- [If! Supportlists] -->1) <! -- [Endif] -->Maximum positive matching;
<! -- [If! Supportlists] -->2) <! -- [Endif] -->Reverse maximum matching;
<! -- [If! Supportlists] -->3) <! -- [Endif] -->Minimum segmentation (minimum number of words in each sentence)
You can also combine the above methods. For example, you can combine the forward maximum matching method with your desired maximum matching method to form a bidirectional matching method. Due to the word-based feature of Chinese, forward least matching and reverse least matching are rarely used. Generally, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and there are fewer ambiguities. The statistical results show that the error rate of positive maximum matching is1/169The error rate of reverse matching is1/245(This may be due to the backward feature of the central Chinese language ). For the mechanical word segmentation method, a general model can be created, expressed in the formASM (D, a, m), That isAutomatic Segmentation Model, Where,
D: Matching direction,+Indicates positive,-Indicates reverse;
A: Increase or decrease the string length (
M: Maximum or minimum matching sign,+Is the maximum match,-Minimum match.
For example,ASM (+,-, +)Is the maximum matching method of forward subtraction (Maximum match based approach, mm),ASM (-,-, +)Is the maximum matching method of reverse subtraction.(Note:RMMMethod)And so on. For modern Chinese, onlyM = +Is a practical method.
2, Statistics-based Word Segmentation Method
In terms of form, words are a stable combination of words. Therefore, the more times adjacent words appear at the same time in the context, the more likely they are to form a word. Therefore, the frequency or probability of adjacent co-occurrence between words can better reflect the word credibility. The frequency of the combination of adjacent co-occurrence words in the corpus can be calculated to calculate their co-occurrence information. Calculate Chinese CharactersXAndYThe formula for the mutual present information is:
Where P (x, y) Yeschinese charactersX , Y Adjacent co-occurrence probability, P (x) , P (y) They are X , Y Probability of appearance in a chat. The interaction information reflects the closeness between Chinese characters. When the closeness is higher than a threshold, the word group may constitute a word. This method only requires statistics on the word group frequency in the corpus, and does not need to be divided into dictionaries. Therefore, it is also called the dictionary-less word segmentation method or the statistical word acquisition method. However, this method also has some limitations. It will often extract frequently used word groups with high co-occurrence frequency but not words, such " This " , " One " , " Yes " , " My " , " Many " In addition, the recognition accuracy of common words is poor, and the overhead of time and space is large. In practice, the statistical word segmentation system must use a basic word segmentation Dictionary (commonly used word dictionary) for string matching and word segmentation, and use statistical methods to identify some new words, the combination of string frequency statistics and string matching not only makes full use of the features of fast and efficient matching and word segmentation, but also uses dictionary-free Word Segmentation in combination with context to identify new words and automatically eliminate ambiguity. The most important indicator of the word segmentation algorithm is accuracy. time complexity should also be considered when both accuracy and accuracy are taken into account. The following describes the maximum positive subtraction matching method.
2Maximum positive subtraction Matching Method
The process of splitting the maximum matching method of forward Subtraction is to extract the set length string from the Chinese statements of natural language. Compared with the dictionary, if the dictionary contains a meaningful word string, separate the output with delimiters; otherwise, shorten the string and search again in the dictionary (the dictionary is pre-defined ).
Algorithm requirements:
Input: Chinese Dictionary, text to be splitD,DSentences separated by punctuation marksS1, Set the maximum term lengthMaxlen.
Output: each sentenceSLThe length to be cut cannot exceedMaxlenAnd separate them with delimiters.S2, AllS2Connection StructureDSplit text.
Algorithm IDEA : Pre-process the webpage into the plain text format of a sentence in each line. Slave D For each sentence S1 From left to right Maxlen Select candidate strings for the bounds W , If W In the dictionary, process the next Length Maxlen Candidate field: otherwise W Remove the zip code and compare it with the dictionary: S1 After splitting, the string that constitutes the word or W It is already a single word, separated by a separator and output S2 . Slave S1 Minus W To continue processing the subsequent strings. S1 Processing is complete.
set the bibliography of text containing sentences to m , the average sentence length is K , the dictionary entry is n . Actual m and K far smaller than n , the steps that determine the complexity of the entire algorithm are n related. Therefore, the time complexity of the algorithm is O (mklogn) .