Search engine spiders How to crawl links

Source: Internet
Author: User
Keywords Crawl this quest
Tags .url abstract analysis content course google google+ how to

Abstract: Search engine spiders, for us is very mysterious, this article is the reason for the use of Spider-Man. Of course, we are not Baidu also is not Google, so can only say the secret, not the secrets. This article is relatively simple, but I don't know

Search engine spiders, for us is very mysterious, this article is the reason for the use of Spider-Man. Of course, we are not Baidu also is not Google, so can only say the secret, not the secrets. This article is relatively simple, just give the friends do not know a way to share, master and cattle please bypass it.

In the traditional sense, we feel that search engine spiders (spider) crawl, should resemble the real spider crawling on the spider web. That is, for example, Baidu Spiders find a link, crawl along this link to a page, and then follow the link inside the page to continue to crawl ... This is similar to a spider's web and resembles a big tree. Although the theory is correct, it is not accurate.

Search engine inside there is a Web site index library, so search engine spiders from the search engine server, follow the search engine has a Web site crawling a webpage, and will crawl back to the content of the Web page. After the page collection back, the search engine will analyze it, the content and link apart, content for the time being not said. Analysis of the link, the search engine will not immediately send spiders to crawl, but the link and anchor text records down to the URL index library for analysis, comparison and calculation, and finally put into the URL index library. After entering the URL index library, there will be spiders to crawl.

That is, if there is a page outside the chain, and does not necessarily immediately have spiders to crawl this page, but there will be a process of analysis and calculation. Even if the chain is removed after the spider has been crawled, the link may have been recorded by the search engine, and then there is the possibility of crawling. And the next time if the spider to crawl this outside the chain of the page, found that the link does not exist, or the chain on the page appeared 404, then just reduce the weight of this outside the chain, should not go to the URL Index library to delete this link.

So there's no link on the page that already doesn't exist, it also works.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.