View Internet crawler design from larbin

Source: Internet
Author: User

The Internet is a huge unstructured database. It has a huge application prospect to effectively search and organize data, especially the XML-Based Structured Data similar to RSS, the internal storage organization method becomes more and more flexible, and retrieval organization and presentation will become more and more widely used. At the same time, there will be more and more requirements on timeliness and readability. All these are based on crawlers and information sources. An efficient, flexible, and scalable crawler has irreplaceable significance for the above applications.

To design a crawler, you must first consider the efficiency. For a network, there are several methods for communication programming based on TCP/IP.

The first is single-thread blocking, which is the simplest and easiest to implement. For example, a simple crawler can be directly implemented through a system command such as curl and pcregrep in shell, but at the same time, its efficiency is also obvious: Because of blocking reading, DNS resolution, connection establishment, write requests, read results, these steps will produce time delay, thus, it is impossible to effectively use all the resources of the server.

The second is multi-thread blocking. Create multiple blocked threads and request different URLs respectively. Compared with the first method, it can more effectively use machine resources, especially network resources. Because countless threads are working at the same time, the network will be fully utilized, however, the CPU resources consumed by machines are also relatively large. The impact of frequent switching between user-level multithreading on performance is worth considering.

The third is non-blocking of a single thread. This is a widely used method. It is widely used in both client and server. Open multiple non-blocking connections in a thread and use poll, epoll, and select to judge the connection status. In the first time, the request is responded, which not only makes full use of network resources, at the same time, it also minimizes the consumption of CPU resources on the local machine. This method requires asynchronous non-blocking operations on DNS requests, connections, and read/write operations. The first method is complicated and can adopt ADNS as a solution, the next three operations are relatively simple and can be implemented directly in the program.

After solving the efficiency problem, you need to consider the specific design problem.

The URL must be processed by a separate class, including displaying, analyzing the URL, and obtaining host, port, and file data.

Then we need to sort the URLs and a large URL hash table is required.

If you want to remove the content of a webpage, you also need a document hash table.

The crawled URL needs to be recorded. Because of the large volume, we need to write it to the disk, so we also need a FIFO class (recorded as urlsdisk ).

The URL to be crawled also needs a FIFO class. When you start again, the URL will be retrieved from the crawled url fifo and written to this FIFO. The running crawler needs to read data from the FIFO and add it to the host class URL list. Of course, the URL will also be read directly from the previous FIFO, but the priority should be lower than the URL in it, after all, it has been crawled.

Crawlers generally crawl multiple websites, but DNS requests within the same site can only be made once. In this case, the host name must be independent of the URL and there is a separate class for processing.

After the host name resolution is complete, an IP address class and its application need to be resolved for use during connect.

The parsing class of HTML documents must also be used to analyze webpages, retrieve the URLs in them, and add them to urlsdisk.

With some strings and scheduling classes added, a simple crawler is basically complete.

The above is basically the design idea of larbin. larbin also has some special processing in specific implementation, such as a webserver and processing of special files. Larbin is not well designed, that is, slow access will increase and occupy a large number of connections, which needs to be improved. In addition, for large-scale crawlers, this only implements the crawling part, for distributed expansion, you also need to add centralized management and scheduling of URLs and distributed algorithms of spider at the front end.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.