Search engine's Word segmentation technology

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Recently in the process of learning SEO found a new term, called word segmentation technology, below and you webmaster simple discussion under the so-called word segmentation technology.

Chinese participle is a sentence or a phrase in accordance with the daily reading habits of mechanical decomposition. English participle is a unit of words, words and words are separated by space, and the Chinese word is the unit, all the words in the sentence to describe a meaning. For example, I like search engine, the result of participle is: I | like the search engine. The Chinese character sequence is divided into meaningful words, that is, Chinese participle, some people also known as cutting words.

Each word in Chinese can be used directly as a word, without word-breaking, which makes it changeable. Although changeable, but flexible in expression. But this is a very difficult problem for search engines to solve. In Chinese participle, there are three kinds of difficult types.

1, Intersection type ambiguity

Suppose "abc" is a, B, C three Chinese characters, if "AB", "BC" are words, then the computer in the segmentation can be "abc" cut into "ab/c", can also be divided into "A/BC". This kind of tangent divergence meaning is called the intersection ambiguity.

2. Combination type ambiguity

If "AB" is a word, "ABC" is also a word, then the resulting tangent divergence is called combinatorial ambiguity.

3, Mixed type ambiguity

Mixed-type ambiguity is a tangent to the ambiguity of intersection type and combinatorial type.

At present, these problems are solved mainly by means of dictionaries and statistics.

First, we'll talk about dictionary segmentation. Dictionaries generally adopt the data storage structure of prefix tree and suffix tree. What is a prefix tree? In fact, we have a sentence from left to right scan once, encountered in the dictionary, some words are identified, encounter compound words to find the longest word matching, encountered not knowing the string on the split into a single word, so a simple word is completed. The suffix tree is scanned from right to left.

Statistical methods, although the dictionary participle has solved many participle problems. But in the face of many new words, participle also faces challenges. The method of segmentation of statistics is based on the knowledge of concepts and informatics. The basic principle is to look for words that often appear together, and the words that are always with each other are likely to form a word.

Word segmentation technology needs to analyze a large number of content, even now Chinese participle is still evolving, there is not a word method can completely solve all problems.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.