Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Chinese participle is a sentence or a phrase in accordance with the daily reading habits of mechanical decomposition. English is a unit of words, words and words are separated by space, and the Chinese word is the unit, all the words in the sentence to describe a meaning. For example, I like the search engine, the result of participle is: I | like the search engine. The Chinese character sequence is divided into meaningful words, that is, Chinese participle, some people also known as cutting words.
Each word in Chinese can be used directly as a word, without word-breaking, which makes it changeable. Although changeable, but flexible in expression. But this is a very difficult problem for search engines to solve. In Chinese participle, there are three kinds of difficult types.
1, Intersection type ambiguity
Suppose "abc" is a, B, C three Chinese characters, if "AB", "BC" are words, then the computer in the segmentation can be "abc" cut into "ab/c", can also be divided into "A/BC". This kind of tangent divergence meaning is called the intersection ambiguity.
2. Combination type ambiguity
If "AB" is a word, "ABC" is also a word, then the resulting tangent divergence is called combinatorial ambiguity.
3, Mixed type ambiguity
Mixed-type ambiguity is a tangent to the ambiguity of intersection type and combinatorial type.
At present, these problems are solved mainly by means of dictionaries and statistics.
First, we'll talk about dictionary segmentation. Dictionaries generally adopt the data storage structure of prefix tree and suffix tree. What is a prefix tree? In fact, we have a sentence from left to right scan once, encountered in the dictionary, some words are identified, encounter compound words to find the longest word matching, encountered not knowing the string on the split into a single word, so a simple word is completed. The suffix tree is scanned from right to left.
Statistical methods, although the dictionary participle has solved many participle problems. But in the face of many new words, participle also faces challenges. The method of segmentation of statistics is based on the knowledge of concepts and informatics. The basic principle is to look for words that often appear together, and the words that are always with each other are likely to form a word. This requires an analysis of a large amount of content. Even now Chinese participle is still evolving, there is not a word segmentation method can completely solve all problems.
Readers who are interested in Chinese participle can read the following documents:
1. Liangnanyum
Written Chinese automatic word segmentation system
Http://www.touchwrite.com/demo/LiangNanyuan-JCIP-1987.pdf
2. Guojin
Statistical language model and some new results of Chinese phonetic word conversion
Http://www.touchwrite.com/demo/GuoJin-JCIP-1993.pdf
3. Guojin
Unacknowledged tokenization and its Properties
Http://acl.ldc.upenn.edu/J/J97/J97-4004.pdf
4. Sun Maosung
Chinese Word segmentation without using lexicon and hand-crafted training data
http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=980775
This article first qining Network Marketing planning www.qi-ning.com reprint, please specify the author information. Thank you!
Qining msn:i@qining.org