C # Spider and crawler (only for technical discussion)

Source: Internet
Author: User

In the article "Making crawler/spider programs (C # Language)", we have introduced the basic implementation methods of crawler programs. We can say that crawler functions have been implemented. However, the download speed may be slow due to an efficiency problem. This is caused by two reasons:

1. Analysis and download cannot be performed simultaneously. In "Making crawler/spider programs (C # Language)", we have introduced two steps for the crawler program: Analysis and download. In a single-threaded program, the two cannot be performed simultaneously. That is to say, the network is idle during analysis. The longer the analysis time, the lower the download efficiency. The same is true for downloading. You cannot perform analysis at the same time. You can perform the next analysis only after stopping the download. As the problem emerges, I think everyone will think of it: will the problem not be solved if we use different threads for analysis and download?

2. Only single-threaded download. I believe everyone has used the experience of downloading resources such as Internet Express, which can be used to set the number of threads (in recent years, the default version is 10, and the default value is 5 ). It will divide the file into the same part as the number of threads, and then each thread downloads its own part, which may improve the download efficiency. I believe that many threads are added to improve the download efficiency. However, careful users will find that, when the bandwidth is fixed, the higher the thread, the faster the speed, but the peak value is reached at a certain point. As a special download tool, crawlers are not capable of multithreading. Why is it efficient? Does crawlers aim to quickly obtain information in the Information Age? Therefore, crawlers need to have multiple threads (controllable quantity) to download webpages at the same time.

Now, after understanding and analyzing the problem, we can solve the problem:

Multithreading is not difficult to implement in C. It has a namespace: system. Threading, which supports multiple threads.

To start a new thread, You need to initialize the following:

Threadstart startdownload = new threadstart (download); // thread start setting: that is, each thread executes download (). Note: Download () must be a method without parameters.

Thread downloadthread = new thread (startdownload); // instantiate the new class to be enabled

Downloadthread. Start (); // enable the thread

Because the method started at the beginning of a thread cannot contain parameters, this makes it difficult to share resources with multiple threads. However, we can use Class-level variables (or other methods, I think this method is the easiest to use) to solve this problem. After you know how to enable multi-threaded download, you may have a few questions:

1. How to control the number of threads?

2. How to prevent multiple threads from downloading the same webpage?

3. How to Determine the end of a thread?

4. How to control the end of a thread?

Below are some solutions to these problems:

1. We can implement the number of threads through the for loop, just like the hitting program for beginners.

For example, if you know that you have specified N (an int variable) threads, you can use the following method to enable five threads:

Thread [] downloadthread; // The famous download thread. This is the advantage of C #, that is, the length of the array during initialization does not need to be specified. It can be specified only during use. This should be a class level, which makes it possible for other method controls.

Threadstart startdownload = new threadstart (download); // thread start setting: that is, each thread executes download ()

Downloadthread = new thread [N]; // apply for resources for the thread and determine the total number of threads

For (INT I = 0; I <n; I ++) // enables the specified number of threads

{

Downloadthread [I] = new thread (startdownload); // specifies the thread start setting.

Downloadthread [I]. Start (); // enable threads one by one

}

Okay. Is it easy to implement control on the number of threads enabled?

2. the following problem occurs: All threads call the donwload () method, so how can they avoid downloading the same web page at the same time?

This problem can be solved by creating a URL address table. Each address in the table can only be requested by one thread. Specific implementation:

You can create a table using a database. The table has four columns, one of which is used to store the URL address. The other two columns respectively store the thread corresponding to the address and the number of requests for the address, the last column stores the downloaded content. (Of course, the corresponding thread column is not necessary ). After a thread applies, the corresponding thread column is set to the current thread number, and whether a column has been applied for is set to apply once. In this way, other threads cannot apply for this page. If the download is successful, the content is saved to the content column. If it fails, the content column is still blank, which serves as one of the basis for whether to re-download. If it fails to be repeated, the process will reach the number of retries (the number of times the address should be requested, which can be set by the user) then, apply for the next URL address. The main code is as follows (VFP is used as an example ):

<Create table>

Create Table (ctablename) (curl M, ctext M, ldowned I, threadnum I) & create a table ctablename. DBF, which contains four fields: Address, text content, number of download attempts, and thread flag (Initial Value:-1, thread flag is an integer starting from 0)

<Extract URL address>

Cfullname = (ctablename) + '. dbf' & add an extension for the table

Use (cfullname)

Go top

Locate for (empty (alltrim (ctext) and ldowned <2 and (threadnum = thisnum or threadnum =-1 )) & find the URL address that has not been downloaded successfully and should be downloaded that belongs to the permission of this thread. thisnum is the number of the current thread and can be obtained through passing Parameters

Goturl = curl

Recnum = recno ()

If recnum <= reccount () Then & if such a URL is found in the list

Update (cfullname) set ldowned = (ldowned + 1), threadnum = thisnum where recno () = recnum & Update table, update this record to applied, that is, the number of downloads plus 1, and the thread flag column is set to the thread number.

<Download content>

Cfulltablename = (ctablename) + '. dbf'

Use (cfulltablename)

Set exact on

Locate for Curl = (csiteurl) & csiteurl is a parameter, which is the URL address corresponding to the downloaded content

Recnumnow = recno () & get the record number containing this address

Update (cfulltablename) set ctext = (ccontent) Where recno () = recnumnow & Insert the corresponding content of the corresponding address

<Insert a new address>

Ctablename = (ctablename) + '. dbf'

Use (ctablename)

Go top

Set exact on

Locate for Curl = (cnewurl) & find whether this address is available

If recno ()> reccount () Then & if this address does not exist

Set carry off

Insert into (ctablename) (curl, ctext, ldowned, threadnum) values (cnewurl), "", 0,-1) & add the home address to the list

Well, this solves thread conflicts in multiple threads. Of course, the de-duplication problem can also be solved in the C # language. Only a temporary file (text) is created at the root to save all URL addresses, you can set corresponding properties for them, but the search efficiency may be less efficient than that of the database.

3. It is difficult to judge the end of a thread because it is always searching for new links. Assume that the thread has not been able to apply for a new URL after N times of repetition, it can be thought that it has downloaded all the links. The main code is as follows:

String url = "";

Int times = 0;

While (url = "") // if no matching record is found, the matching record is continuously searched.

{

Url = geturl. getaurl (...... ); // Call the getaurl method to obtain a URL Value

If (url = "") // if not found

{

Times ++; // The number of attempts increases.

Continue; // perform the next attempt

}

If (times> N) // if the number of attempts has been reached, exit the process.

{

Downloadthread [I]. Abort; // exit the process

}

Else // if no attempts are made

{

Times = 0; // The number of attempts to return to zero

}

// Process the URL.

}

4. This problem is relatively simple, because it is recommended that the thread be named as a class-level array in Issue 1, which makes it easy to control. You can end with a for loop. The Code is as follows:

For (INT I = 0; I <n; I ++) // closes the specified number of n threads.

{

Downloadthread [I]. Abort (); // close threads one by one

}

Well, a spider program is completed in this way. In front of C #, its implementation is so simple.

Here, I would like to remind readers that the author only provides an idea and a feasible solution, but it is not the best, even if the solution itself, there are also many improvements for readers to think about.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.