Detailed analysis of Chinese Word Segmentation technology for search engines

Source: Internet
Author: User

Someone asked me what is Chinese word segmentation in the SEO (Search Engine Optimization) QQ Group. In view of the time, I only spoke about the simple steps of Chinese word segmentation, next we will give an example to give a detailed understanding.

First, let's take a look at the following sentence: "is Shanghai qishang the best SEO consulting company in Shanghai ?"

When the search engine (which will be built in the future) reads the above sentence, it will cut the word for the first time. The result is as follows: shanghai, Shanghai, Qi, Shang, yes, Di, district, best, show, enterprise, industry, S, E, O, consulting, consulting, company, company,?

Note that the above are 20 Chinese characters and 3 English letters, and the result of the first word segmentation is to remove the duplicate keywords "up" and "sea. Such word splitting is a simple single word method. Of course, the search engine is not a dummies and will not only use such a simple word segmentation method.

Then, the search engine performs binary word segmentation and the results are as follows: shanghai, haiqi, qishang, Shang is, is an enterprise, Enterprise, Business S, Se, EO, and O of Shanghai, Haiti, region, region, and region. consulting, consulting, consulting, company, company?

Note that the above injury occurs twice, and only one word is used, because they are indeed a word. There is no difference between the first "Shanghai" and the "Shanghai" in the middle. This is only the second word segmentation, followed by three-way word segmentation and four-way word segmentation.

Seo may ask, what is the use of the word search engine? Don't worry, we haven't talked about the following.

Second, how does se filter these words? We all know many Chinese words, such as "what", "Ah", "where", and so on.ArticleIt will be used. If all these words are included in the se, it is estimated that the database will not be able to store them for a long time, because almost every web page indexed by Baidu currently has these words, such a huge project, se must be too busy. Even if it is too busy, few will search for meaningless keywords like "ah. So Se defines these words as the stop keyword, that is, the "Stop Word" mentioned in English ". Keywords will be deleted as long as they are stopped.

Maybe Seo has started to ask questions. Is a keyword like "Yes" also divided into one word? Don't worry, there are more exciting ones. After the stop keyword is ignored, the remaining keywords start to match with the se dictionary. The original dictionary may only have the basic Keywords of the Ci Hai, but with the development of the Internet, the word library of Se is far greater than that of Ci Hai. It will be explained in detail later.

Like Shanghai, region, excellent, enterprise, consulting, and company, we all know that it must be a key word of Ci Hai, it must belong to the word library of Se.

Here, some SEO start to ask, so how is the word se like "Seo" determined? If you look at the keywords recommended by Microsoft Bing search for "Seo", we can see that they, like Baidu and Google, cannot solve this problem, the recommended keywords are "color ". Then how did they analyze the keyword? This relies on the powerful statistical function of Se. When se finds that a large number of users search for the keyword "Seo", they will add this keyword to the dictionary. A new word is generated at this time.

Of course, this is another situation where Yu Jun, the father of Baidu Post Bar, and Li Yanhong, Baidu's CEO, both look forward to a word segmentation method, that is, understanding, when searching for "the best SEO company in Shanghai", se can understand what information the user is looking, he will find that users are most concerned with the keyword "best", so they will be promoted to the first place with "best" website. This site successfully achieved the No. 1 goal in the double ranking by the Chinese word segmentation feature. To sum up, se actually has three Word Segmentation Methods: one is string matching, such as the list method and the binary method (relying on the dictionary), and the other is relying on statistical data, to obtain a large number of new words. This method is called Statistical word segmentation. Another method is to implement matching on the basis of statistics, that is, understanding, this method is used to calculate what the keyword is searched for and the second keyword is not found. For example, many people search for "Best" and find that they do not get the desired serp, then I searched for the "best". At this time, a large amount of statistics showed that the "best" was useful and added it to the "best" to form a new word. The first two word segmentation methods have been widely recognized by se. The third method currently does not provide statistics on words with low search volumes, so it cannot be understood when the statistics are not available ".

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.