Spider technology: Several Problems to be Solved when designing Spider (Author: du bird studio)

Source: Internet
Author: User
Tags website server

Author: ferry bird studio Co., http://hi.baidu.com/dudubirdstudio. (Copyright, reprinted must indicate the source)

Spider is an important component of the entire search engine system and can be said to be the foundation of the search engine. It not only provides search objects for search engines-massive data volumes, but also enables search engines to rise from a retrieval tool to an information integration platform.
The essence of a search engine is information integration, which builds a user platform. This makes search engines a good and profitable target for commercialization.

A good search engine must be equipped with a good spider. A good spider must be elegantly designed by designers.

SPIDER should solve the following problems during design:

1. Capture Efficiency
The capture efficiency is related to the performance of computer hardware, the quantity of hardware, and the bandwidth of the network. However, to improve the capture efficiency, you cannot simply add hardware, instead, we need to use limited hardware resources to capture the most webpages within a certain period of time.
Common Methods to Improve crawling efficiency include:
(1) multi-thread concurrent capturing
(2) Single-thread, non-blocking I/O crawling
(3) distributed crawling: The crawling work is distributed across multiple servers. search engines such as Google also include geographically distributed, the crawling server cluster is distributed to the backbone networks of various countries around the world.
(4) The crawling efficiency depends largely on the website server and bandwidth speed of the captured object. Therefore, when designing a spider, you must have the ability to estimate the load and bandwidth of the website server and have a good scheduling policy, so that the website server can be accessed at different frequencies.

2. Capture Quality
The purpose of designing a spider is not to capture all webpages on the Internet, but to capture important and up-to-date webpages.
How to capture high-quality web pages? Which webpages are of high quality? Designers are required to understand the Internet and users' habits and common sense.
From the Internet perspective, links between webpages on a website are a very important resource. Therefore, when capturing webpages, you must record links between webpages for link analysis, therefore, the quality of the web page is evaluated through the link relationship.
The evaluation indicators of Web Page importance can be started from the following aspects:
(1) link depth of a webpage.
(2) webpage inbound.
(3) The inbound level of the parent webpage of the webpage.
(4) number of Web Page duplicates.
These indicators are based on the following common sense:
(1) The most important web pages of a website are placed at the front, that is, the lighter the level. The homepage and the webpage to which the homepage points are all important.
(2) Many webpages and important webpages direct to this webpage. This webpage is important, just as the search tool Sci evaluates the quality of the paper, the more times the paper is referenced, the more important the paper is.
(3) The second point is about references. Another one is reprinting. The more times the webpage is reposted, the more important the webpage is, just like newspapers or magazines, all good articles have been extensively reproduced by other media.
In specific implementation, except the first indicator, the other three indicators can be obtained only through statistics in the pre-processing phase.
Therefore, you should first obtain as many home pages as possible, and then adopt a wide crawling policy from the home page of the website.

3. Courtesy of capturing
Courtesy crawling is manifested in: websites do not crawl webpages if they are not allowed to crawl, control the frequency of Website access, and spider crawling behavior cannot affect normal user access. Therefore, spider must:
(1) Limit the number of webpages crawled on a website per unit time.
(2) restrict the number of threads/processes simultaneously captured on the same website.
(3) control the time interval between crawlers of the same website.
(4) Following the robots and meta tag sitemap.htm protocols, do not access directories that are not allowed to be accessed.
(5) the User-Agent and form fields are used to identify the spider's identity, contact email, and spdier precautions page URL in the requests sent when the webpage is captured.

4. Avoid repeated capturing
The reason for repeated crawling is:
(1) A large number of web pages on the Internet are referenced by other web pages, which makes the URLs of the same web page appear in multiple different web pages. This requires the spider to have the URL deduplication function.
(2) The webpage is reprinted by other webpages, which makes the same article appear on pages with different URLs. This requires Spider to have the content deduplication function, which is difficult to implement at present, currently, many search engine companies have not effectively solved this problem.
(3) webpage URLs have multiple forms of representation, which is caused by the correspondence between DNS and IP addresses.
A URL corresponds to a webpage, but the URL can be expressed in the following two ways:
[Protocol: //] domain name [: Port] [/path/file name]
[Protocol: //] dotted-decimal IP Address [: Port] [/path/file name]
The domain name and IP address have the following relationships:
-- One-to-one, http://www.baidu.com and http: // 220.231.39.97 point to the same web page.
-- One-to-many, DNS rotation, http://www.163.com and http: // 202.108.42.73, http: // 202.108.42.91 point to the same web page.
-- Multiple-to-one: VM. Multiple Domain Names correspond to the same IP address. Different URLs direct to different webpages.
A website has multiple domain names and corresponds to the same IP address. For example, www.netease.com and www.163.com point to the same webpage.
-Multiple Domain Names correspond to multiple IP addresses. A website has multiple domain names and uses DNS rotation technology. A domain name corresponds to multiple IP addresses.

5. Capture Data Updates
Data Capturing and updating is a very important issue. It determines whether users can search for the latest news and content immediately, however, because of the large volume of web pages on the Internet in Shanghai, a long crawling cycle is required. If a Web page is updated every time it is crawled again, it is bound to be updated for a long time.
The webpages crawled by the spider may be modified or deleted. The SPIDER should regularly check the updates of These webpages and update the original webpage library, extracted database, and index database.
As the Internet continues to generate new web pages, SPIDER will also capture them.
Different websites have different update cycles, some of which are long and a little short.
Spider classifies websites based on the update cycle of the website. The website crawling cycle varies with the update cycle.
Generally, when a web spider updates the original webpage library, it does not need to crawl the webpage corresponding to the URL again. For most webpages, you only need http head requests and conditional GET requests for updates.

6. Content Extraction
Various types of files to be captured by Spider, such as HTML and XML Web pages, Doc, PPT, xls, PDF and other documents with formats, and multimedia data such as images, audios, and videos, spider files of different types must be extracted from plain text.
For documents such as Doc and PDF, the vendor provides corresponding text extraction interfaces for documents generated by software provided by professional vendors.
For HTML and XML Web pages, apart from the title and body, there will be a lot of copyright information, AD links, and public channel links. These links have nothing to do with the text body, when extracting webpage content, you also need to filter these useless links.
For multimedia files, images, and other files, the content of these files is usually determined by the linked anchor text (that is, the link text) and Related File annotations. In addition, many multimedia files have file attributes. Considering these attributes, you can better understand the file content.
Web content extraction is generally in the form of a plug-in. You can use a plug-in to manage service programs and use different plug-ins to process webpages in different formats. The advantage of this method is that it has good scalability. Every time a new type is found in the future, you can make the processing method a plug-in and add it to the plug-in management service program.

7. Estimation of hardware input, capture speed, one capture time, and captured data volume
Sun Tzu said: Everything is pre-established. Emphasize the importance of the plan. Many things must be well known.
To design a spider, you must consider how long it takes to capture 1 billion web pages, how many hard disks are needed, and how many servers are needed. You can obtain the best/maximum value through estimation.
For example, how many machines should be used for capturing and how many crawling processes/Threads should one machine start?
Here we need to consider the hardware resources:
-- LAN bandwidth // transmission rate
-- Internet access bandwidth
-- LAN Time Delay // 1 ~ MS 10
-- Internet Time Delay // 100 ~ 500 MS
-- Request and response time received by the server
-- CPU utilization
-- Memory size and Utilization
-- Hard disk size and read/write speed
-- System Load

The above mentioned several issues that should be taken into consideration during the design of spider. This is just a rough introduction to the big aspect. In fact, the technology is still a work of fine workmanship, technical personnel are required to constantly polish the data.

Note: copyright, reproduced must indicate the source, Author: ferry bird studio Co., http://hi.baidu.com/dudubirdstudio.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.