Source: http://blog.chinaunix.net/u1/34978/showart_422243.html
View Internet crawler design from larbin
Yudund
2005.12.16
Reprinted please indicate the source
The Internet is a huge unstructured database. It has a huge application prospect to effectively search and organize data, especially the XML-Based Structured Data similar to RSS, the organization of content becomes more flexible, and the retrieval and presentation of content will become more and more widely used. At the same time, there will be more and more requirements on timeliness and readability. All these are based on crawlers and information sources. An efficient, flexible, and scalable crawler has irreplaceable significance for the above applications.
To design a crawler, you must first consider the efficiency. For a network, there are several methods for communication programming based on TCP/IP.
The first is single-thread blocking, which is the simplest and easiest to implement. For example, a simple crawler can be directly implemented through a system command such as curl and pcregrep in shell, but at the same time, its efficiency is also obvious: Because of blocking reading, DNS resolution, connection establishment, write requests, read results, these steps will produce time delay, thus, it is impossible to effectively use all the resources of the server.
The second is multi-thread blocking. Create multiple blocked threads and request different URLs respectively. Compared with the first method, it can more effectively use machine resources, especially network resources. Because countless threads are working at the same time, the network will be fully utilized, however, the CPU resources consumed by machines are also relatively large. The impact of frequent switching between user-level multithreading on performance is worth considering.
The third is non-blocking of a single thread. This is a widely used method. It is widely used in both client and server. Open multiple non-blocking connections in a thread and use poll, epoll, and select to judge the connection status. In the first time, the request is responded, which not only makes full use of network resources, at the same time, it also minimizes the consumption of CPU resources on the local machine. This method requires asynchronous non-blocking operations on DNS requests, connections, and read/write operations. The first method is complicated and can adopt ADNS as a solution, the following three operations can be implemented directly in the program.
After solving the efficiency problem, you need to consider the specific design problem.
The URL must be processed by a separate class, including displaying, analyzing the URL, and obtaining host, port, and file data.
Then we need to sort the URLs and a large URL hash table is required.
If you want to remove the content of a webpage, you also need a document hash table.
The crawled URL needs to be recorded. Because of the large volume, we need to write it to the disk, so we also need a FIFO class (recorded as urlsdisk ).
The URL to be crawled also needs a FIFO class. When you start again, the URL will be retrieved from the crawled url fifo and written to this FIFO. The running crawler needs to read data from the FIFO and add it to the host class URL list. Of course, the URL will also be read directly from the previous FIFO, but the priority should be lower than the URL in it, after all, it has been crawled.
Crawlers generally crawl multiple websites, but DNS requests within the same site can only be made once. In this case, the host name must be independent of the URL and there is a separate class for processing.
After the host name resolution is complete, an IP address class and its application need to be resolved for use during connect.
The parsing class of HTML documents must also be used to analyze webpages, retrieve the URLs in them, and add them to urlsdisk.
With some strings and scheduling classes added, a simple crawler is basically complete.
The above is basically the design idea of larbin. larbin also has some special processing in specific implementation, such as a webserver and processing of special files. Larbin is not well designed, that is, slow access will increase and occupy a large number of connections, which needs to be improved. In addition, for large-scale crawlers, this only implements the crawling part, to achieve distributed expansion, you also need to add centralized URL management and scheduling, as well as distributed algorithms of the front-end spider.
Usage instructions for larbin Web Crawlers
Larbin is a crawler tool. I recently saw this article about larbin, an efficient search engine crawler tool, on the Internet, I prefer this tool (compared to the crawl of nutch), because it is written in C ++, similar to C. I am familiar with it and can modify it myself, by the way, I would like to learn about C ++ (past few years of experience tells me that it is much faster to change others' things to learn a technology than to write helloworld from scratch ). So I started my hard larbin trial.
Looking back, I encountered problems because I did not carefully read the documents. Next time, even e-Wen will have a good look. You cannot try it blindly, which is a waste of time.
Larbin official address: http://larbin.sourceforge.net/index-eng.html
1. Compile
Let's just say, whahahaha, that is! Because the code from the Official Website Cannot be compiled (for Linux GCC)
./Configure
Make
Gcc-O3-wall-d_reentrant-c-o parse. O parse. c
Parse. C: 115: Error: conflicting types for 'adns _ parse_domain'
Internal. h: 571: Error: previous declaration of 'adns _ parse_domain 'was here
Parse. C: 115: Error: conflicting types for 'adns _ parse_domain'
Internal. h: 571: Error: previous declaration of 'adns _ parse_domain 'was here
Gmake [1]: *** [parse. O] Error 1
Gmake [1]: Leaving directory '/home/LEO/larbin-2.6.3/ADNS'
Make: *** [all] Error 2
The function prototype and definition are inconsistent:
Open the./ADNS/Internal. h file and comment out lines 568-571.
Ii. Run
This will not be said,./larbin will run, and the configuration can be done first in larbin. conf, and this configuration will not be said ..
After running, you can view the running status through http: // host: 8081, which is a good idea. There is one in larbin. conf: The inputport 1976 configuration, that is, the URL to be crawled can be added during running. This idea is very good,? How to add it? As at first, http: // host: 1976 does not work. An error is returned ??? After trying for a long time, I didn't get any results. I finally tracked it through GDB. Alas, I could just add the Telnet host 1976 directly. Later, I saw what I wrote in the document. I fainted .....
3. Result
Haha, when I got off work, I got started with a middleware. When I went to bed that night, I dreamed that my search engine was catching up with Google over Baidu. That was so excited.
When I went to work the next day, I checked the results and found that there was nothing except some FIFO * files in the directory, which was depressing. No way. Let's look at the document "How To mimize larbin" and find the following description:
The first thing you can define is the module you want to use for ouput. This defines what you want to do with the pages larbin gets. Here are the different options:
Default_output: This module mainly does nothing, does T statistics.
Simple_save: This module saves pages on disk. It stores 2000 files per directory (with an index ).
Pai_save: This module saves pages on disk with the hierarchy of the site they come from. It uses one directory per site.
Stats_output: This modules makes some stats on the pages. In order to see the results, see http: // localhost: 8081/output.html.
By default, nothing is output, So I carefully read the only two documents on the official website, modified options. h, and then compiled the results.
Changed in my option:
Simple_save simply outputs two thousand files in a directory, including indexes.
Cgilevel = 0 processes the server program, that is, but the URL contains? & = And other querstrings are also processed.
No_dup
The rest can be modified as needed. For details, see the article "How To mimize larbin.
Iv. Problems
During usage, it is found that when a webpage is captured, if the URL contains an unencoded (encodurl) Chinese, it cannot be captured. For a brief look, See src/utils/URL. filenormalize in CC involves. So I wrote an encodurl function and added it to the URL class constructor. The problem is solved.
Due to the need for better customization, this tool does not seem to meet my needs. In the end, I didn't use her, but I used Perl to make it suitable for me on the basis of www: simplerobot .. Besides, Perl should not be slower than C ++ in character seek processing. In general, the performance of this tool is good .. Haha.
However, I still wrote this to my friends who didn't read the document (hopefully few) and warned me to read the document carefully.