Using search engine principles to explain what a reptile (spider) is

Source: Internet
Author: User

Many people look at the crawler is marvellous, but also caused a most common "practice after the experience"-practice proved that Baidu Crawler will receive the original content of seconds!

Of course, in the eyes of any one who understands the principles of the search engine, this is an unreliable practice. If practice is the way to verify the truth, then if there is a better theoretical hypothesis to be validated later. And like the crawler does not have the ability to analyze content, how can you determine whether the content of the page is original after the collection?

Even some people think that the crawler is not going to crawl the content of the collection, which is even more strange, the reptile is not a prophet, how can you know the page before crawling is collected? (This is not considered a special case, that is, the search engine may refer to the site's overall original rate to determine the crawl priority problem, but this is relatively deep)

Search engine four system: Download, analysis, index, query, these four pieces of work is basically independent, judge the work of collecting or not is in the analysis system. And it is estimated that due to the efficiency of large-scale page search, repeated pages are generally indexed after a longer period of time will be deleted. That is, the search engine included pages or not, at least not the quality of the page itself.

It has now been explained that the crawler can not judge the quality of the page, but in fact, in the strict sense, the crawler does not even pull the link, it is simply a TCP/IP program. But the analysis of the link is always done, otherwise the crawler can not crawl the new page. To be exact, the analysis link is assigned to the dispatcher. Crawler 1 Crawl page, page to dispatcher 1 analysis, Dispatcher 1 put all the found links to the URL library 1, and some dispatchers think important links back to Reptile 1, let crawler 1 to crawl those important pages. At the same time, crawler 1 crawled pages to the page library 1, if page 1 inside the pages and URL library 1 Repeat, no longer repeat crawl.

Large commercial search engines are many reptiles work together, at this time each "dispatcher" and "master scheduling" exchange of information, so that the specific work of each crawler. If you see a few reptiles often take a short time to take a page to crawl a lot of words, often is the dispatch work did not do well.

But in fact, such as "dispatcher" and so on, into the reptile program is not wrong. Just a relatively rigorous statement, a relatively loose statement. But anyway, the crawler just download, with more than a few tricks to the dispatcher to download it.

This article is from http://www.csdinuan.com and allows reprint, but please keep the link.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.