Summary of Chinese word segmentation algorithm __NLP

Source: Internet
Author: User

Chinese word segmentation algorithm is now generally divided into three categories: based on string matching, based on understanding, based on statistical participle.
Based on string matching participle: Machine segmentation algorithm. Matches the string to be divided with an entry in a sufficiently large machine dictionary. It is divided into forward matching and reverse matching, maximum length matching and minimum length matching, simple word segmentation and segmentation and tagging process integration method. So commonly used are: forward maximum matching, reverse maximum matching, the least segmentation method ... In practical application, the machine segmentation is used as the initial means to improve the segmentation accuracy by using the language information. In order to reduce the error rate of matching, the original string can be divided into smaller string and mechanical matching, which can be used as a breakpoint to identify the words with obvious characteristics.
Based on the understanding of participle: word at the same time the semantic analysis of the sentence to simulate the understanding of the sentence, including the word breaker subsystem, syntactic semantic system, the general control part. Under the coordination of the general control part, the word segmentation system can get the syntactic and semantic information about words, sentences and so on to judge the ambiguity. Need a lot of language knowledge information.
Based on statistics: the number of adjacent words appear at the same time, the more likely to constitute a word, the corpus of words in the frequency of statistics, do not need to cut the word dictionary, but the error rate is very high. May consider: uses the basic dictionary to carry on the word segmentation, uses the statistical method to identify the new phrase, the two combination.
Semantic understanding Chinese participle: Solve a word polysemy problem, set up personalized database for users.
There is a problem in Chinese participle: the definition of ambiguity between computer and human cannot be unified; the recognition rate of words not included in the dictionary is low.
Word Segmentation system judgment criteria (to be solved): ambiguity recognition, new words (not sign words) recognition ...

Word Segmentation Model:
N-ary Model: N=1 calculates the product of the frequency of all words in a clause to get the relative frequency of the sentence;
n=2 according to a transfer matrix, the probability of another word appearing after each word is given--first order Markov chain. (n=2,3,3.) corresponding to N-order Markov chains.
Maximum entropy hidden Markov model--based on this model, and then according to some characteristic parameters of characters to determine whether the character is a single word or a word of the left edge of the right edge or the middle of the stem, so that participle into Chinese character marking process.
(the Chinese Academy of Sciences Works and the direction graph combination) The multilayer hidden Markov model---expands, may apply the model to the principle segmentation, does not sign the word recognition, the hidden horse participle and so on, the low-level model may help the high-level model to carry on the disambiguation.
The model is not the main reason that affects the word segmentation system, and it needs to combine the use of the model, the rule and the unidentified word recognition.
Matching method participle:
The combination of forward-back maximum matching and minimum matching, according to the POS tagging disambiguation, establishing rules to deal with the thesaurus can not solve the problem.
First use matching method participle, find ambiguity, look forward to two words, use heuristic disambiguation rules, according to rules (longest matching, word length, morpheme, probability, etc.) to eliminate the best segmentation method of the current word.
Based on decision tree and forward graph:
(This system test results are good, Microsoft Research Acl_sighan Competition works) each word including word list and not signed as a node to join in the decision tree, using the analyzer or dynamic programming method to the structure of the decision tree to find a better method of segmentation. Each non-leaf node has a corresponding parameter that determines whether its child node is a word or a few words output. The advantage is: word recognition at the same time its use of the rules are also preserved as a historical tree.
(Chinese Academy of Sciences Acl_sighan works) as a node in the direction graph, and give each edge and each point a weight value, the word segmentation process is transformed into a hidden Markov model to find the shortest path to the problem.
Google participle technology provided by http://www.basistech.com/, Baidu participle development.
In the search for Word segmentation technology by http://www.hylanda.com (massive technology) to provide.
Chinese word segmentation system is available:
CDWs (the modern written Chinese distinguishing word system)
CASS---Beihang
Seg,segtag: Tsinghua University
Fudan Word System-Fudan University;
Hit Word system---the use of statistical methods of the pure word segmentation system, trying to combine the serial frequency statistics and word matching;
MM System-Hangzhou University (improved mm algorithm);
Peking University Word Segmentation system---Beijing University Computational Language Research Institute
Ictclas---cas (at present better system)
The automatic word segmentation system in the MicroSoft Chinese syntactic analyzer;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.