C # implements multithread control of spider/Crawler program

Source: Internet
Author: User
Tags thread

In the "Crawler/Spider Program Production (C # language)" article, has introduced the crawler implementation of the basic methods, it can be said that the crawler has realized the function. It's just that there is an efficiency problem and the download speed may be slow. This is caused by two reasons:

1. Analysis and download can not be synchronized. The Reptile/Spider program (C # language) has introduced two steps of the crawler: Analysis and downloads. In a single-threaded program, the two cannot be performed concurrently. In other words, the analysis will result in network idle, the longer the analysis time, the lower the efficiency of the download. The contrary is the same, download can not be analyzed at the same time, only to stop downloading after the next analysis. Problems surfaced, I think everyone will think: the analysis and download with a different thread, the problem is not solved?

2. Just a single thread download. I believe that everyone has used the internet, such as download resources experience, it is possible to set the number of threads (in recent years, the default is 10, the default is 5). It splits the file into the same part as the number of threads, and then each thread downloads its own part, which is likely to improve the download efficiency. I believe everyone has a number of multithreading to enhance the download efficiency experience. But careful users will find that, in the case of a certain bandwidth, not more threads, faster, but at a certain point to reach the peak. Crawler as a special download tool, does not have the ability to multithreading how efficient can talk about? The purpose of the reptile in the information age is not to get information quickly? Therefore, the crawler needs to have multiple threads (controllable number) while downloading the Web page.

Well, to understand and analyze the problem is to solve the problem:

Multithreading is not difficult to implement in C #. It has a namespace: System.Threading, which provides multithreading support.

To open a new thread, you need the following initialization:

ThreadStart startDownload = new ThreadStart( DownLoad );
//线程起始设置:即每个线程都执行DownLoad(),注意:DownLoad()必须为不带有参数的方法
Thread downloadThread = new Thread( startDownload ); //实例化要开启的新类
downloadThread.Start();//开启线程

Because a method that starts at the start of a thread cannot have parameters, it adds trouble to the multithreaded shared resource. But we can use class-level variables (and, of course, other methods, which I think are easiest to use) to solve this problem. Once you know how to turn on multithreaded downloads, you may have several questions:

1. How to control the number of threads?

2. How to prevent multithreading to download the same Web page?

3. How to determine the end of the thread?

4. How do I control the thread end?

Here are some solutions to these issues:

1. Number of threads we can do it through a for loop, just like the program that we were programmed to do.

For example, a known user specified n (it is an int variable) a thread, you can use the following method to open five of threads.

Thread[] downloadThread;
//声名下载线程,这是C#的优势,即数组初始化时,不需要指定其长度,可以在使用时才指定。
这个声名应为类级,这样也就为其它方法控件它们提供了可能
ThreadStart startDownload = new ThreadStart( DownLoad );
//线程起始设置:即每个线程都执行DownLoad()
downloadThread = new Thread[ n ];//为线程申请资源,确定线程总数
for( int i = 0; i < n; i++ )//开启指定数量的线程数
{
downloadThread[i] = new Thread( startDownload );//指定线程起始设置
downloadThread[i].Start();//逐个开启线程
}

OK, is it easy to implement control on the number of open threads?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.