Baidu Lee: Overview of Search engine indexing system (i)

Source: Internet
Author: User
Tags key key words query

Since the last August Baidu Webmaster platform Lee released about the search engine crawl information after 2 months have passed, this time Lee continued to release the search engine indexing system information. Anyway, wood and wood seo think Baidu official announcement we still want to understand and analysis. The following is the official announcement of Baidu:

As we all know, the main work of the search engine includes: crawl, storage, page analysis, indexing, retrieval, and several other major processes. The past few weeks have been a brief introduction to the crawl-related process. Today, a brief introduction to the indexing system, in billion for the unit of the Web page to find certain key words like the sea inside fishing needle, perhaps a certain time to complete the search, but the user can not afford to, from the user experience point of view we have to give users in the millisecond level satisfied with the results, otherwise users can only be lost. How can we achieve this requirement?

If you know which pages the user is looking for (after query-cut), then the process of user retrieval can be imagined as containing the different parts of query after the process of intersection of the pages, and the search has become the page name comparison between the intersection. In this way, it is possible to retrieve hundreds of millions of units in milliseconds. This is usually referred to as the inverted index and retrieval of the process of intersection. The following are the basic procedures for establishing an inverted index:

  

(1) The process of page analysis is to identify and mark different parts of the original page, for example: title, keywords, content, link, anchor, comment, other unimportant areas, etc.

(2) The process of participle actually includes cutting words, participle, synonym conversion, synonym replacement and so on, to a page title participle as an example, will be such data: term text, Termid, speech, part of speech, etc.;

(3) before the completion of preparation, the next is to establish inverted index, the formation of {Termàdoc}, can be roughly understood as follows, why is "Term->doc", rather than directly applying "doc->term"?

  

The above is the inverted indexing process in indexing system, which is an important link for search engine to realize millisecond retrieval.

Well, the above on Baidu released the full text, of course, is very simple, want to learn more can see wood seo "do not understand the principle of the search engine is in the nude," I think we can understand in more detail. In addition, several words in the above article may not understand, simply say: term is the word text, that is, the key words; Termid is the word identification.

Article editor from: Wood-wood SEO blog Http://blog.sina.com.cn/mumuhouzi



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.