Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Recently in the process of learning SEO found a new term, called word segmentation technology, below and you webmaster simple discussion under the so-called word segmentation technology.
Chinese participle is a sentence or a phrase in accordance with the daily reading habits of mechanical decomposition. English participle is a unit of words, words and words are separated by space, and the Chinese word is the unit, all the words in the sentence to describe a meaning. For example, I like search engine, the result of participle is: I | like the search engine. The Chinese character sequence is divided into meaningful words, that is, Chinese participle, some people also known as cutting words.
Each word in Chinese can be used directly as a word, without word-breaking, which makes it changeable. Although changeable, but flexible in expression. But this is a very difficult problem for search engines to solve. In Chinese participle, there are three kinds of difficult types.
1, Intersection type ambiguity
Suppose "abc" is a, B, C three Chinese characters, if "AB", "BC" are words, then the computer in the segmentation can be "abc" cut into "ab/c", can also be divided into "A/BC". This kind of tangent divergence meaning is called the intersection ambiguity.
2. Combination type ambiguity
If "AB" is a word, "ABC" is also a word, then the resulting tangent divergence is called combinatorial ambiguity.
3, Mixed type ambiguity
Mixed-type ambiguity is a tangent to the ambiguity of intersection type and combinatorial type.
At present, these problems are solved mainly by means of dictionaries and statistics.
First, we'll talk about dictionary segmentation. Dictionaries generally adopt the data storage structure of prefix tree and suffix tree. What is a prefix tree? In fact, we have a sentence from left to right scan once, encountered in the dictionary, some words are identified, encounter compound words to find the longest word matching, encountered not knowing the string on the split into a single word, so a simple word is completed. The suffix tree is scanned from right to left.
Statistical methods, although the dictionary participle has solved many participle problems. But in the face of many new words, participle also faces challenges. The method of segmentation of statistics is based on the knowledge of concepts and informatics. The basic principle is to look for words that often appear together, and the words that are always with each other are likely to form a word.
Word segmentation technology needs to analyze a large number of content, even now Chinese participle is still evolving, there is not a word method can completely solve all problems.