Multi-threaded control of spider/Crawler programs (C # language)

Source: Internet
Author: User
Tags exit curl empty implement thread visual studio
Program | multithreading | control

In the "Crawler/Spider Program Production (C # language)" article, has introduced the crawler implementation of the basic methods, it can be said that the crawler has realized the function. It's just that there is an efficiency problem and the download speed may be slow. This is caused by two reasons:

1. Analysis and download can not be synchronized. The Reptile/Spider program (C # language) has introduced two steps of the crawler: Analysis and downloads. In a single-threaded program, the two cannot be performed concurrently. In other words, the analysis will result in network idle, the longer the analysis time, the lower the efficiency of the download. The contrary is the same, download can not be analyzed at the same time, only to stop downloading after the next analysis. Problems surfaced, I think everyone will think: the analysis and download with a different thread, the problem is not solved?

2. Just a single thread download. I believe that everyone has used the internet, such as download resources experience, it is possible to set the number of threads (in recent years, the default is 10, the default is 5). It splits the file into the same part as the number of threads, and then each thread downloads its own part, which is likely to improve the download efficiency. I believe everyone has a number of multithreading to enhance the download efficiency experience. But careful users will find that, in the case of a certain bandwidth, not more threads, faster, but at a certain point to reach the peak. Crawler as a special download tool, does not have the ability to multithreading how efficient can talk about? The purpose of the reptile in the information age is not to get information quickly? Therefore, the crawler needs to have multiple threads (controllable number) while downloading the Web page.

Well, to understand and analyze the problem is to solve the problem:

Multithreading is not difficult to implement in C #. It has a namespace: System.Threading, which provides multithreading support.

To open a new thread, you need the following initialization:

ThreadStart startdownload = new ThreadStart (DownLoad); Thread start setting: That is, each thread executes DownLoad (), note: DownLoad () must be a method without parameters

Thread downloadthread = new Thread (startdownload); Instantiate a new class to open

Downloadthread.start ()//Open Thread

Because a method that starts at the start of a thread cannot have parameters, it adds trouble to the multithreaded shared resource. But we can use class-level variables (and, of course, other methods, which I think are easiest to use) to solve this problem. Once you know how to turn on multithreaded downloads, you may have several questions:

1. How to control the number of threads?

2. How to prevent multithreading to download the same Web page?

3. How to determine the end of the thread?

4. How do I control the thread end?

Here are some solutions to these issues:

1. Number of threads we can do it through a for loop, just like the program that we were programmed to do.

For example, a known user specified n (it is an int variable) a thread, you can use the following method to open five of threads

thread[] downloadthread;//fame Download thread, this is the advantage of C #, that is, array initialization, do not need to specify its length, can be specified when used. This reputation should be class-level, which also provides the possibility for other method controls to

ThreadStart startdownload = new ThreadStart (DownLoad);//thread Start setting: That is, each thread executes the DownLoad ()

Downloadthread = new thread[n];//request resources for threads, determining total number of threads

for (int i = 0; i < n; i++)//Open a specified number of threads

{

Downloadthread[i] = new Thread (startdownload);//Specify Thread start setting

Downloadthread[i]. Start ()//open thread one by one

}

OK, is it easy to implement control on the number of open threads?

2. The following problem arises: All threads call the Donwload () method, so how do you avoid them downloading the same page at the same time?

This problem is also good to solve, as long as the establishment of the URL Address table, the table in each address is only allowed by a thread to apply. Specific implementation:

You can use the database to create a table with four columns, one of which is dedicated to storing the URL address, the other two columns holding the corresponding thread of the address and the number of times the address was requested, and the last column storing the downloaded content. (Of course, a list of corresponding threads is not necessary). When the thread is requested, set the corresponding threads column to the current thread number, and set the request for a column to apply once, so that no other thread can request the page. If the download succeeds, the content is saved to the Content column. If this is not successful, the content column is still empty, as one of the basis for downloading again, and if repeated unsuccessful, the process will request the next URL address after the number of retries (for the number of times the address should be applied, if the user can set it). The main code is as follows (for example, VFP):

< set up table >

CREATE TABLE (Ctablename) (Curl m, Ctext m, ldowned I, threadnum i) && create a table ctablename.dbf containing address, text content, number of attempts to download, Thread flag (initial value is-1, thread flag is an integer starting from 0) four fields

< extract URL address >

Cfullname = (ctablename) + '. dbf ' && add extension for table

Use (Cfullname)

Go top

LOCATE for (EMPTY Alltrim (ctext) and ldowned < 2 and (threadnum = thisnum OR threadnum =-1)) && find has not yet been downloaded Successful and should download the URL address belonging to this thread permission, Thisnum is the number of the current thread and can be passed by parameter

Goturl = Curl

Recnum = Recno ()

If Recnum <= reccount () THEN && If you find such a URL address in the list

Update (cfullname) SET ldowned = (ldowned + 1), threadnum = Thisnum WHERE recno () = recnum && Update table, updating this record as requested, download Number of times plus 1, the thread flag column is set to the number of this thread.

< download content >

Cfulltablename = (ctablename) + '. dbf '

Use (Cfulltablename)

SET EXACT on

LOCATE for curl = (Csiteurl) && Csiteurl is the parameter that corresponds to the URL address of the downloaded content

Recnumnow = Recno () && get the record number containing this address

UPDATE (cfulltablename) SET Ctext = (ccontent) WHERE recno () = Recnumnow && Insert corresponding address

< Insert new address >

Ctablename = (ctablename) + '. dbf '

Use (Ctablename)

Go top

SET EXACT on

LOCATE for curl = (Cnewurl) && Find there is no such address

If Recno () > RecCount () THEN && If this address is not yet available

SET CARRY off

INSERT into (ctablename) (Curl, Ctext, ldowned, Threadnum) VALUES ((Cnewurl), "", 0,-1) && Add home Address to List

Well, this solves the thread conflict in multiple threads. Of course, the problem can also be addressed in the C # language, only the root to create a temporary file (text can), save all the URL address, set the appropriate properties on them, but the search efficiency may not be faster than the database.

3. The end of a thread is difficult to judge because it is always looking for new links. The user thinks it can be assumed that the thread repeats n times or fails to apply for a new URL, so it can be thought that it has already downloaded all the links. The main code is as follows:

String url = "";

int times = 0;

while (url = "")//If no records are found that match the criteria, keep looking for records that meet the criteria

{

url = Geturl.getaurl (...); /Call Getaurl method, try to get a URL value

if (url = = "")//If not found

{

Times ++;//attempts to increase

Continue Make the next attempt

}

if (Times > N)//If enough has been attempted, exit the process

{

Downloadthread[i]. Abort; Exit process

}

else//If you don't try enough times

{

Times = 0; Number of attempts to zero processing

}

To proceed with the next step for the resulting URL

}

4. The problem is relatively simple, as it has been suggested in question one that the thread should be known as a class series, which is easy to control. Just use a For loop to end. The code is as follows:

for (int i = 0; i < n; i++)//Turn off the number of threads of the specified number of n

{

Downloadthread[i]. Abort ()//close the thread individually

}

Well, a spider program is done so, in front of C #, it is simple to achieve.

Here I would also like to remind readers: the author only provides a thought and a solution can be achieved, but it is not the best, even if the program itself, there are many can improve the place, leaving the reader to think.

Finally, explain the environment I use:

WinXP SP2 Pro

VFP 9.0

Visual Studio 2003. NET Chinese Enterprise Edition



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.