Chinese search engine technology unveiling: web spider (2)

Source: Internet
Author: User

Source: e800.com.cn


Basic Principles of web spider

Web spider is an image name. Comparing the Internet to a spider, a spider is a web crawler. Web Crawlers use the link address of a webpage to find a webpage. Starting from a webpage (usually the homepage) of a website, they read the content of the webpage and find other link addresses on the webpage, search for the next Webpage through these links until all the webpages of the website are crawled. If the whole Internet is regarded as a website, the web spider can use this principle to capture all the web pages on the Internet.

For search engines, it is almost impossible to capture all the web pages on the Internet. From the data published currently, the largest search engine can only capture about 40% of the total number of web pages. One of the reasons is the bottleneck of crawling technology. It is impossible to traverse all webpages, and many webpages cannot be found from links of other webpages. The other reason is the problems of storage and processing technologies, if the average size of each page is 20 KB (including images), the size of the 10 billion page is 100 × GB, even if it can be stored, there are also problems with downloading (according to the download of 20 k per second on one machine, it takes 340 machines to download continuously for one year to download all webpages ). At the same time, because the data volume is too large, it will also affect the efficiency when providing search. Therefore, web crawlers of many search engines only crawl important web pages, and the main reason for evaluating the importance of web pages during crawling is the link depth of a Web page.

Web Crawlers generally have two strategies when capturing webpages: breadth first and depth first (as shown in ). Breadth First means that the web spider will first capture all the webpages linked to the starting webpage, then select one of them to continue crawling all the webpages linked to this webpage. This is the most commonly used method, because this method can allow network Spider to process in parallel and increase its crawling speed. Depth first means that a web spider will trace a link from the start page, process the line, and then transfer it to the next start page to continue tracking the link. This method has the advantage that web spider is easier to design. The difference between the two policies will be clearer.

Because it is impossible to capture all the web pages, some web crawlers set the access layers for some less important websites. For example, in a, a is the starting web page, which belongs to layer 0, B, c, d, e, f belongs to layer 1st, g, h belongs to layer 2nd, And I belongs to layer 3rd. If the number of access layers set by the web spider is 2, the web page I will not be accessed. This also allows some websites to search for some webpages on the search engine, while others cannot. For website designers, the flat website structure design helps search engines capture more webpages.

Web Crawlers often encounter encryption data and webpage permissions when accessing websites. Some webpages can be accessed only with membership permissions. Of course, the website owner can use the protocol to prevent web spider crawlers from crawling (as described in the following section), but for some websites that sell reports, they want the search engine to search for their reports, however, they cannot be viewed by searchers for free. In this way, they need to provide the corresponding user name and password to the web spider. Web Crawlers can crawl These webpages through the given permissions to provide search. When a searcher clicks to view the webpage, the searcher also needs to provide corresponding permission verification.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.