Shanhan: A brief talk on Chinese word segmentation technology of Baidu search engine

Source: Internet
Author: User

Understand the search engine segmentation technology for our SEO work has a significant significance, whether it is our keyword layout or link structure, with participle has a great relationship. Here Shanhan to talk about the next 100 degrees Chinese participle (of course, not limited to Baidu, other search engines are similar). This article is divided into two parts, the first is to extract the existing interpretation of participle, in addition to add my own to the expansion of the idea of participle.

  What is Chinese participle?

We all know that English sentences are made up of a single word by space, so it is much more convenient in Word segmentation, but our Chinese is a Chinese character connected, so it is relatively complex. Chinese participle refers to the process of dividing a Chinese sentence into a single word and then combining it into a sequence of words according to certain rules. This is also called "Chinese cut word".

Participle for the search engine has a very large role, is the basis of text mining, can help the program automatically identify the meaning of the sentence to achieve a high search results matching, the quality of word segmentation directly affect the accuracy of the search results. At present, search engine segmentation methods mainly through dictionary matching and statistics two methods.

 A method of Word segmentation based on dictionary matching

This method must first have a very large dictionary, that is, the word Segmentation index library, and then according to a certain rule will be the string to be participle and word in the library to match the words, if found a word, the match is successful, this match is divided into the following four ways:

1, the forward maximum matching method (from left to right direction);

2, reverse maximum matching method (from right to left direction);

3, minimum segmentation (so that the number of words cut out in each sentence is the smallest);

4, bidirectional maximum matching method (carry on from left to right, from right to left two times scan)

In general, search engines are used in a variety of ways. But this way also brought to the search engine, such as the processing of ambiguity (the key is our Chinese profound AH), in order to improve the accuracy of matching, search engines will also simulate the understanding of the sentence, to achieve the recognition of words effect. The basic idea is to make syntactic and semantic analysis at the same time, and use syntactic information and semantic information to deal with ambiguity phenomenon. Usually includes three parts: participle subsystem, syntactic system, the general control part. Under the coordination of the general control part, the segmentation subsystem can get the syntactic and semantic information about words and sentences to judge the ambiguity of word segmentation, that is, it simulates the process of human understanding of sentences. This method needs to use a lot of language knowledge and information, of course, our search engine is also improving.

 Second, the method of segmentation based on statistics

Although the word segmentation dictionary solves a lot of problems, but still far from enough, search engines also have to have the ability to discover new words, by calculating the probability of adjacent words to determine whether it is a separate word. Therefore, the more the mastery of the context, the more accurate understanding of the sentence, the more accurate participle. For example, "Search engine optimization", matching in the dictionary may be: Search/engine/optimization, search/index/drive/optimization, but after the later probability calculation, found that "search engine optimization" in the context of the number of adjacent occurrences of a very many, then based on statistics will be added to the Word Segmentation Index Library. About this I am in the "on the electric quotient and the circle participle test" is the same example.

The application of Chinese word segmentation

Word segmentation accuracy is very important for search engines, but if the speed is too slow, even if the accuracy is high, for the search engine is also not available, because the search engine needs to deal with hundreds of millions of pages, if the word consumption for too long, will seriously affect the speed of the search engine content update. Therefore, for search engines, the accuracy and speed of participle, both need to achieve a high demand.

For our SEO practitioners, the principle and method of participle must be mastered, so that we can design the site so that the search engine easy to determine its relevance to the subject. For example, our website is about SEO training, when users search the word, the search engine will be the first word, such as "SEO" and "training", and then in the index library to match separately. Here also involves a little, but also my own summary, each word participle after a word and adverb, usually a priority to match the subjects, and then match the adverb, such as here obviously SEO is the subjects, so priority to match the word, and then training this adverb. So, our website should be how to layout and structure, leave everybody to think about.

Author: Shanhan Starting Shanhan SEO blog, the original address: http://www.xiaohan86.com/2011061149.html reprint, please specify the source.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.