Baidu keyword algorithm-Chinese word segmentation algorithm-Analysis of keywords using Word Segmentation Algorithms

Source: Internet
Author: User
Baidu keyword algorithm uses the word segmentation algorithm to analyze keywords. This can make our keywords more accurate and make our seo optimization work more effective.

Understanding Word Segmentation technology is of great significance to SEO work. You can analyze keywords from a scientific perspective and propose keyword deployment strategies. If the positive maximum matching algorithm is correct, it can be concluded that the word segmentation weights after word segmentation are in positive order.

How to design an efficient search engine? We can discuss how to design a practical search engine with the technical means adopted by Baidu. Search engines involve many technical points, such as query processing, sorting algorithms, page capturing algorithms, CACHE mechanisms, and ANTI-SPAM. These technical details, as a search engine service provider for commercial companies such as Baidu and GOOGLE, will not be made public. We can regard the existing search engine as a black box. By submitting input to the black box, we can judge the output returned by the black box and roughly judge the unknown technical details in the black box.

This article describes and summarizes the query processing and Chinese Word Segmentation techniques in the Baidu pre-processing stage through inductive Analysis of search results and common word segmentation algorithms, if you have a certain understanding of the data structure and algorithm, it will be easier to understand. In my personal sense, it is not accurate enough to obtain the positive maximum matching algorithm, both specialized dictionaries and general dictionaries have different weights. The search frequency must be related. Based on this, when words in multiple specialized dictionaries appear, it is a two-way maximum matching algorithm that needs to be used to detect which exclusive word should be split first. Of course, this is a personal conjecture and needs to be studied.

Query Processing and Word Segmentation

With the rise of the search economy, people began to pay more attention to the performance, technology, and daily traffic of various search engines around the world. As an enterprise, it will choose whether to publish advertisements based on the popularity of search engines and daily traffic. As a common Internet user, you can select your desired search engine based on the search engine performance and technology. As a technician, You can take representative search engines as the research object. The rise of the search engine economy once again proves the huge business opportunities of the Internet. When the network leaves the search, there will be only empty and messy data, and a large number of gold mines waiting for laborious mining.

Query Processing and word segmentation are essential tasks of a Chinese search engine, baidu, as a typical Chinese search engine, has always stressed that it has key technologies and advantages that other search engines do not possess in "Chinese processing. So let's take a look at what core technologies Baidu uses.

Query Processing/Chinese word segmentation.

I. Query Processing

The user submits a query to the search engine. The search engine generally performs some processing after receiving the user's query, and then extracts the relevant information from the index database. So what does Baidu do after receiving user queries?

1. Assume that the submitted Chinese query contains English words. How does the search engine handle this? For example, to query "BT download movie", Baidu's method is to keep the English in the Chinese string as a whole and use this as the breakpoint to split the Chinese into different ones. In this way, the above query is switched, whether the English in the middle is a word in the dictionary or a random character, it will be treated as a whole. For what it is, you can query "dfdfdf Download Movie" to see the results. Of course, this is also true if the query contains numbers.

2. Assume that the user has submitted more than one query string, for example, "information retrieval theory Tools ". Then, the search engine splits the query string into several subquery strings Based on delimiters such as spaces and punctuation marks. For example, the preceding query is parsed as three substrings; this principle is simple. Let's look at it.

3. Assume that the submitted query contains duplicate content. What should the search engine do? For example, when querying the "theoretical tool theory", Baidu treats repeated strings as only once, that is, processing them into equivalent "theoretical tools", while GOOGLE apparently does not merge them, instead, it increases the weight of the substring for repeated queries. So how can we draw this conclusion? We can submit the "theoretical tool" to Baidu and return 341,000 documents. Let's take a look at the returned content on the first page.

We submitted the query "theoretical tool Theory". When we look at the returned results, there are still so many returned documents. Of course, this does not indicate too many problems. Let's look at the sorting of the returned results on the first page, see it? The order is completely unchanged, while GOOGLE's order is somewhat changed, which means Baidu combines repeated queries into one for processing, in addition, the order in which strings appear is not considered (GOOGLE considers this order ).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.