Use C # to implement multi-thread control of spider/crawler programs

Source: Internet
Author: User

In the article "Making crawler/spider programs (C # Language)", we have introduced the basic implementation methods of crawler programs. We can say that crawler functions have been implemented. However, the download speed may be slow due to an efficiency problem. This is caused by two reasons:

1. Analysis and download cannot be performed simultaneously. In "Making crawler/spider programs (C # Language)", we have introduced two steps for the crawler program: Analysis and download. In a single-threaded program, the two cannot be performed simultaneously. That is to say, the network is idle during analysis. The longer the analysis time, the lower the download efficiency. The same is true for downloading. You cannot perform analysis at the same time. You can perform the next analysis only after stopping the download. As the problem emerges, I think everyone will think of it: will the problem not be solved if we use different threads for analysis and download?

2. Only single-threaded download. I believe everyone has used the experience of downloading resources such as Internet Express, which can be used to set the number of threads (in recent years, the default version is 10, and the default value is 5 ). It will divide the file into the same part as the number of threads, and then each thread downloads its own part, which may improve the download efficiency. I believe that many threads are added to improve the download efficiency. However, careful users will find that, when the bandwidth is fixed, the higher the thread, the faster the speed, but the peak value is reached at a certain point. As a special download tool, crawlers are not capable of multithreading. Why is it efficient? Does crawlers aim to quickly obtain information in the Information Age? Therefore, crawlers need to have multiple threads (controllable quantity) to download webpages at the same time.

Now, after understanding and analyzing the problem, we can solve the problem:

Multithreading is not difficult to implement in C. It has a namespace: system. Threading, which supports multiple threads.

To start a new thread, You need to initialize the following:

Threadstart startdownload = new threadstart (download); // thread start setting: that is, each thread executes download (). Note: Download () it must be the method thread downloadthread = new thread (startdownload) without parameters; // instantiate the new class downloadthread to be enabled. start (); // enable the thread

Because the method started at the beginning of a thread cannot contain parameters, this makes it difficult to share resources with multiple threads. However, we can use Class-level variables (or other methods, I think this method is the easiest to use) to solve this problem. After you know how to enable multi-threaded download, you may have a few questions:

1. How to control the number of threads?

2. How to prevent multiple threads from downloading the same webpage?

3. How to Determine the end of a thread?

4. How to control the end of a thread?

Below are some solutions to these problems:

1. We can implement the number of threads through the for loop, just like the hitting program for beginners.

For example, if you know that you have specified N (an int variable) threads, you can use the following method to open five threads.

Thread [] downloadthread; // The famous download thread. This is the advantage of C #, that is, the length of the array during initialization does not need to be specified. It can be specified only during use. This should be a class level, which provides other method controls with the possible threadstart startdownload = new threadstart (download); // thread start setting: that is, each thread executes download () downloadthread = new thread [N]; // request resources for the thread and determine the total number of threads for (INT I = 0; I <n; I ++) // enable the specified number of threads {downloadthread [I] = new thread (startdownload); // specify the thread start setting downloadthread [I]. start (); // enable threads one by one}

Okay. Is it easy to implement control on the number of threads enabled?

2. the following problem occurs: All threads call the donwload () method, so how can they avoid downloading the same web page at the same time?

This problem can be solved by creating a URL address table. Each address in the table can only be requested by one thread. Specific implementation:

You can create a table using a database. The table has four columns, one of which is used to store the URL address. The other two columns respectively store the thread corresponding to the address and the number of requests for the address, the last column stores the downloaded content. (Of course, the corresponding thread column is not necessary ). After a thread applies, the corresponding thread column is set to the current thread number, and whether a column has been applied for is set to apply once. In this way, other threads cannot apply for this page. If the download is successful, the content is saved to the content column. If it fails, the content column is still blank, which serves as one of the basis for whether to re-download. If it fails to be repeated, the process will reach the number of retries (the number of times the address should be requested, which can be set by the user) then, apply for the next URL address. The main code is as follows (VFP is used as an example ):

<Create table> Create Table (ctablename) (curl M, ctext M, ldowned I, threadnum I) & create a table ctablename. DBF, containing address, text content, number of attempts to download, thread flag (Initial Value:-1, thread flag is an integer starting from 0) four fields <extracted URL address> cfullname = (ctablename) + '. dbf' & add the extension use (cfullname) Go toplocate for (empty (alltrim (ctext) and ldowned <2 and (threadnum = thisnum or threadnum =-1) for the table )) & find the URL address that has not been successfully downloaded and should be downloaded that belongs to the permission of this thread. thisnum is the number of the current thread. You can pass the parameter to get goturl = curl recnum = r Ecno () If recnum <= reccount () Then & if such a URL address Update (cfullname) set ldowned = (ldowned + 1) is found in the list ), threadnum = thisnum where recno () = recnum & Update table, update this record to applied, that is, the number of downloads plus 1, and the thread flag column is set to the thread number. <Download content> cfulltablename = (ctablename) + '. dbf'use (cfulltablename) set exact on locate for Curl = (csiteurl) & csiteurl is a parameter, which is the URL address recnumnow = recno () corresponding to the downloaded content () & get the record number containing this address Update (cfulltablename) set ctext = (ccontent) Where recno () = recnumnow & Insert the corresponding content of the corresponding address <insert new address> ctablename = (ctablename) + '. dbf'use (ctablename) go top set exact onlocate for Curl = (cnewurl) & find the address if recno ()> reccount () then & if there is no such address set carry offinsert into (ctablename) (curl, ctext, ldowned, threadnum) values (cnewurl), "", 0,-1) & add the home address to the list

Well, this solves thread conflicts in multiple threads. Of course, the de-duplication problem can also be solved in the C # language. Only a temporary file (text) is created at the root to save all URL addresses, you can set corresponding properties for them, but the search efficiency may be less efficient than that of the database.

3. It is difficult to judge the end of a thread because it is always searching for new links. Assume that the thread has not been able to apply for a new URL after N times of repetition, it can be thought that it has downloaded all the links. The main code is as follows:

String url = ""; int times = 0; while (url = "") // if no matching record is found, then you will constantly find the matching record {url = geturl. getaurl (...... ); // Call the getaurl method and try to get a URL value if (url = "") // If {times ++ is not found; // The number of attempts is auto-incrementing and continue; // perform the next attempt} If (times> N) // if the attempt is sufficient, exit the process {downloadthread [I]. abort; // exit process} else // if there are not enough attempts {times = 0; // The number of attempts returns to zero} // process the URL in the Next Step}

4. This problem is relatively simple, because it is recommended that the thread be named as a class-level array in Issue 1, which makes it easy to control. You can end with a for loop. The Code is as follows:

For (INT I = 0; I <n; I ++) // close the specified number of n threads {downloadthread [I]. Abort (); // close threads one by one}

Well, a spider program is completed in this way. In front of C #, its implementation is so simple.

Here, I would like to remind readers that the author only provides an idea and a feasible solution, but it is not the best, even if the solution itself, there are also many improvements for readers to think about.

Finally, I will describe the environment I use:

WINXP sp2 pro

VFP 9.0

Visual Studio 2003. NET Enterprise Edition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.