Chinese word segmentation technology One: Concept

Source: Internet
Author: User

Word Segmentation Technology is a search engine for the user to submit a query keyword string query processing according to the user's keyword string with a variety of matching methods to carry out a technology. Of course, we often use Chinese word segmentation technology in data mining, precision recommendation and natural language processing.

Why should Chinese participle be done?

Words are the smallest meaningful language components that can be independently active, the English word is a space as a natural delimiter, while the Chinese language is the basic writing unit, there is no obvious distinction between the words, so the Chinese word analysis is the basis and key to the use of the language.

The processing of Chinese in Lucene is based on the automatic segmentation of the word, or the two-yuan segmentation. In addition, there are the largest segmentation (including forward, backward, and before and after the combination), the least segmentation, full segmentation and so on.

Ii. Classification of Chinese word segmentation technology

We discuss the word segmentation algorithm can be divided into three categories: based on dictionary, thesaurus matching Word segmentation method, based on word frequency statistics and word segmentation method based words.

The first kind of method uses dictionary matching, Chinese lexical or other Chinese language knowledge to do word segmentation, such as: forward maximum matching method, inverse maximum matching method, and minimum matching method. This kind of method is simple, the word segmentation efficiency is high, but the Chinese language phenomenon is rich, the dictionary completeness, the rule consistency and so on the question makes it difficult to adapt to the open large-scale text word processing (such as the non-login word).

The second kind of statistics-based word segmentation method is based on the statistical information of words and words, such as the information between adjacent words, word frequency and the corresponding co-occurrence of information, such as the application of the word segmentation, because these information is through the investigation of real corpus obtained, so based on statistical segmentation method has a good practicality.

Category IIIWord-tagging-based word segmentation is actually a word-building method. The word segmentation process is regarded as the labeling problem in the string. Because each word occupies a definite word-building position when constructing a particular word.(that is, the word bit), if there is a maximum of four word-formation positions for each character:B (Word head),M (in the word),E (ending)and theS (separate into words), then the following sentence(Armor)the results of the participle can be directly expressed as(b)as shown in the Verbatim notation form:

( krabi ) word breaker Result:/SHANGHAI/plan/ N /This/century/end/implementation/per capita/domestic/production/total/5,000 USD/.

(b)Word Callout form: up/BSea/ECount/BRow/E N/SBen/sWorld/Bji/EEnd/SReal/BNow/Eperson/Bboth/ECountry/BInside/EHealth/BProduction/ETotal/BValue/EFive/Bthousand/MBeauty/MYuan/E. /S

The first thing to say is that the word "words" is not limited to Chinese characters. Considering that Chinese real text inevitably contains a certain number of non-Chinese characters, the "word" in this article also includes characters such as foreign letters, Arabic numerals and punctuation marks. All these characters are the basic unit of word-building. Of course, Chinese characters are still the largest number of characters in this unit collection.

Here is a brief introduction to several common methods :

1) Word-wise traversal method.

By word traversal, all words in the dictionary are searched verbatim in the article in the order of length to short , until the end of the article. This means that no matter how short The article is, how big the dictionary is , you have to iterate through the dictionary. This method is less efficient, and the larger system is generally not used.

2) Word segmentation method based on dictionary and thesaurus Matching (mechanical segmentation method)

This method matches the string of Chinese characters to be analyzed according to a certain strategy, and matches the entry in a "full-size" machine dictionary, if one is found in the dictionary, the match succeeds. The recognition of a word, according to the different scanning direction is divided into positive matching and inverse matching. The maximum (longest) match and the minimum (shortest) match are divided according to the case of the different length preference. It can be divided into simple word segmentation method and integrated method of word segmentation and labeling according to whether the tagging process is combined with POS. The usual methods are as follows:

A, the maximum forward matching method (maximummatchingmethod) is usually referred to as mm method. The basic idea is that, assuming that the longest word in the word-breaker dictionary has i character characters, the first I word in the current string of the processed document is used as the matching field to find the dictionary. If such an I term exists in the dictionary , the match succeeds and the match field is sliced out as a word. If no such I term is found in the dictionary, the match fails, the last word in the matching field is removed, and the remaining strings are matched again ... This goes on until the match succeeds, that is, the tangent of a word or the length of the remaining string is zero. This completes a match and then takes the next i - word string for matching until the document is scanned.

The algorithm is described as follows:

(1) Initialize the current position counter, set to 0 ;

(2) starting from the current counter, before taking 2i characters as matching fields until the end of the document;

(3) if the match field length is not 0 , the matching process is found in the dictionary with the same length.

If the match succeeds,

The

A) the matching field as a word to slice out, put into the word-breaker table;

b) Add the value of the current position counter to the length of the matching field;

c) jump to step 2);

Otherwise

A) if the last character of the matching field is a kanji character,

The

① the last word of the matching field;

② match field length minus 2;

Otherwise

① remove the last byte of the matching field;

② match field length minus 1;

b) jump to step 3);

Otherwise

A) if the last character of the matching field is a kanji character,

The value of the current position counter is incremented by 2;

Otherwise the value of the current position counter is incremented by 1;

b) jump to step 2).

b, the inverse maximum matching method (reversemaximummatcingmethod) is usually referred to as RMM method. The basic principle of RMM method is the same as mm method , The difference is that the direction of segmentation is opposite to mm method, and the thesaurus used is different. The inverse maximum matching method starts the matching scan from the end of the processed document, each time taking the end of the 2i character (i - word string) as the matching field, if the match fails, The first word in the matching field is removed and the match continues. Correspondingly, it uses the word-breaker dictionary, in which each entry is stored in reverse order. In the actual processing, the document is first inverted, generating a reverse preamble file. Then, according to the Dictionary of reverse order, the inverse preamble file is processed by the forward maximum matching method.

Description

Due to the many positive structures in Chinese, it is possible to improve the accuracy if the forward matching is correct. Therefore, the error of the inverse maximum matching Fabienne forward maximum matching method is small. The statistical results show that the error rate of pure positive maximum matching is 1/169, and the error rate of using inverse maximum matching is 1/245. For example, the segmentation field "Postgraduate production", the results of the positive maximum matching method will be "Master / production", and the inverse maximum matching method using reverse scanning, can get the correct word segmentation results "master / Research / production ".

Of course, the maximal matching algorithm is a kind of mechanical word segmentation method based on the word segmentation dictionary, which can not be segmented according to the semantic characteristics of the document context, so the dependence on the dictionary is large, so it will inevitably cause some participle errors in the actual use, in order to improve the accuracy of the system segmentation. The word segmentation scheme (i.e. bidirectional matching method) can be combined with forward maximum matching and inverse maximum matching method.

C, Minimum segmentation method: The number of words cut out in each sentence is the smallest.

D, bidirectional matching method: The forward maximum matching method and the inverse maximum matching method are combined. First, according to the punctuation of the document coarse segmentation, the document into a number of sentences, and then to these sentences with the forward maximum matching method and inverse maximum matching method for scanning segmentation. If the two word segmentation methods get the same result, the word segmentation is considered correct, otherwise, the minimum set is processed.

3). Word segmentation based on total segmentation and frequency statistics of words

The word segmentation method based on frequency statistics is a full-segmentation method. Before we discuss this method, we need to understand the relevant content of full segmentation.

Full Slice

The whole slicing request obtains all the acceptable segmentation forms of the input sequence, and the partial segmentation only obtains one or several acceptable segmentation forms, because partial segmentation ignores the possible other segmentation forms, so the segmentation method based on partial segmentation may omit the correct segmentation regardless of the ambiguity correction strategy. Causes a word breaker error or failure. The word segmentation method based on the whole segmentation, which has made all the possible segmentation forms, avoids the omission of the possible segmentation form and overcomes the defect of the partial segmentation method.

The whole segmentation algorithm can obtain all possible segmentation forms, its sentence coverage and word segmentation coverage are 100%, but the full segmentation is not widely used in text processing, the reasons are as follows:

A) the whole segmentation algorithm only can obtain the correct participle premise, because the full segmentation does not have the ambiguity detection function, the final participle result correctness and the complete dependence on the independent ambiguity processing method, if the evaluation is wrong, also can cause the wrong result.

b) The number of segmentation results of total segmentation increases exponentially with the increase of sentence length, on the one hand, it will result in huge useless data flooding in the storage database; On the other hand, when the sentence length reaches a certain distance , the segmentation efficiency is greatly reduced because of too many forms.

Word segmentation method based on the frequency statistics of words:

This is a full segmentation method. It does not rely on the dictionary , But the text of any two words at the same time the frequency of statistics , the higher the number of times may be a word. It first cuts out all the possible words that match the Thesaurus , and uses statistical language models and decision algorithms to determine the optimal segmentation results. It has the advantage of discovering all the different meanings of the cut and easily extracting the new words.

4). Segmentation method based on Knowledge understanding (Word annotation)

This method is mainly based on syntactic and grammatical analysis, combined with semantic analysis, through the analysis of the information provided by the context content, it usually consists of three parts: the word segmentation subsystem, the syntactic-French sub-system and the general control part. Under the coordination of the general control part, the word segmentation subsystem can get the syntactic and semantic information about the words and sentences to judge the ambiguity. This kind of method tries to make the machine have human understanding ability, need to use a lot of language knowledge and information. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into the form that machine can read directly. Therefore, the current knowledge-based word segmentation system is still in the experimental stage.

5). A new method of Word segmentation

Parallel Word Segmentation method: This word segmentation method with the help of a pipeline containing word thesaurus , the comparison of matching process is step to carry out , Each step can be entered into the pipeline and the words in the Word library corresponding words in the comparison , the speed of Word segmentation can be improved greatly, because there are multiple words to match at the same time . This method involves multi-level internal code theory and the dictionary data structure of pipeline.

Chinese word segmentation technology One: Concept

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.