Since the last August Baidu Webmaster platform Lee released about the search engine crawl information after 2 months have passed, this time Lee continued to release the search engine indexing system information. Anyway, wood and wood seo think Baidu official announcement we still want to understand and analysis. The following is the official announcement of Baidu:
As we all know, the main work of the search engine includes: crawl, storage, page analysis, indexing, retrieval, and several other major processes. The past few weeks have been a brief introduction to the crawl-related process. Today, a brief introduction to the indexing system, in billion for the unit of the Web page to find certain key words like the sea inside fishing needle, perhaps a certain time to complete the search, but the user can not afford to, from the user experience point of view we have to give users in the millisecond level satisfied with the results, otherwise users can only be lost. How can we achieve this requirement?
If you know which pages the user is looking for (after query-cut), then the process of user retrieval can be imagined as containing the different parts of query after the process of intersection of the pages, and the search has become the page name comparison between the intersection. In this way, it is possible to retrieve hundreds of millions of units in milliseconds. This is usually referred to as the inverted index and retrieval of the process of intersection. The following are the basic procedures for establishing an inverted index:
(1) The process of page analysis is to identify and mark different parts of the original page, for example: title, keywords, content, link, anchor, comment, other unimportant areas, etc.
(2) The process of participle actually includes cutting words, participle, synonym conversion, synonym replacement and so on, to a page title participle as an example, will be such data: term text, Termid, speech, part of speech, etc.;
(3) before the completion of preparation, the next is to establish inverted index, the formation of {Termàdoc}, can be roughly understood as follows, why is "Term->doc", rather than directly applying "doc->term"?
The above is the inverted indexing process in indexing system, which is an important link for search engine to realize millisecond retrieval.
Well, the above on Baidu released the full text, of course, is very simple, want to learn more can see wood seo "do not understand the principle of the search engine is in the nude," I think we can understand in more detail. In addition, several words in the above article may not understand, simply say: term is the word text, that is, the key words; Termid is the word identification.
Article editor from: Wood-wood SEO blog Http://blog.sina.com.cn/mumuhouzi