GJM: use C # To implement web crawler (1) [reprint],

Source: Internet
Author: User

GJM: use C # To implement web crawler (1) [reprint],

 

 

Web Crawlers play a major role in information retrieval and processing and are an important tool for collecting network information.

Next we will introduce the simple implementation of crawlers.

The crawler workflow is as follows:

Crawlers download network resources from the specified URL until the specified resources of the address and all sub-addresses are downloaded.

Next, we will analyze the crawler implementation step by step.

 

1. Collections to be downloaded and those already downloaded

To save the URL to be downloaded and prevent repeated downloads, we need to use two sets to store the URL to be downloaded and the URL already downloaded.

Because some other URL-related information, such as depth, needs to be saved while saving the URL, so here I use Dictionary to store these URLs.

The specific type is Dictionary <string, int> Where string is a Url string, and int is the depth of the Url relative to the base URL.

Check the undownloaded set at the beginning. If it is empty, the download is completed. If there is a URL, add the first URL to the downloaded set, and download the resource of this URL.

 

2. HTTP request and response

C # There are encapsulated HTTP request and response classes HttpWebRequest and HttpWebResponse, so it is much easier to implement.

To improve the download efficiency, we can use multiple concurrent requests to download resources from multiple URLs at the same time. A simple method is to use asynchronous requests.

You can use the following method to control the number of concurrency:

1 private void DispatchWork () 2 {3 if (_ stop) // determine whether to abort download 4 {5 return; 6} 7 for (int I = 0; I <_ reqCount; I ++) 8 {9 if (! _ ReqsBusy [I]) // determines whether the worker instance of this number is idle for 10 {11 RequestResource (I); // requests the worker instance for resources 12} 13} 14}

Because no new thread is explicitly opened, a worker instance is used to represent a logical worker thread.

1 private bool [] _ reqsBusy = null; // each element indicates whether a working instance is working 2 private int _ reqCount = 4; // The number of working instances

Every time a work instance completes the work, the corresponding _ reqsBusy is set to false and DispatchWork is called, then DispatchWork can assign new tasks to idle instances.

 

Next, send the request.

1 private void RequestResource (int index) 2 {3 int depth; 4 string url = ""; 5 try 6 {7 lock (_ locker) 8 {9 if (_ urlsUnload. count <= 0) // determine whether there are undownloaded URL10 {11 _ workingSignals. finishWorking (index); // set the status of the working instance to Finished12 return; 13} 14 _ reqsBusy [index] = true; 15 _ workingSignals. startWorking (index); // set the working state to Working16 depth = _ urlsUnload. first (). value; // retrieve the first undownloaded URL17 url = _ urlsUnload. first (). key; 18 _ urlsLoaded. add (url, depth); // Add the URL to the downloaded 19 _ urlsUnload. remove (url); // Remove 20} 21 22 HttpWebRequest req = (HttpWebRequest) WebRequest. create (url); 23 req. method = _ method; // Request Method 24 req. accept = _ accept; // Accept content 25 req. userAgent = _ userAgent; // User Agent 26 RequestState rs = new RequestState (req, url, depth, index); // callback method parameter 27 var result = req. beginGetResponse (new AsyncCallback (ReceivedResource), rs); // asynchronous request 28 ThreadPool. registerWaitForSingleObject (result. asyncWaitHandle, // registration timeout handling method 29 TimeoutCallback, rs, _ maxTime, true); 30} 31 catch (WebException we) 32 {33 MessageBox. show ("RequestResource" + we. message + url + we. status); 34} 35}

7th acts to ensure synchronization when multiple tasks are concurrent, and a mutex lock is added. _ Locker is a member variable of the Object type.

Row 3 determines whether the undownloaded set is empty. If it is empty, set the current Working instance status to Finished. If it is not empty, set it to Working and extract a URL to start downloading. When all work instances are Finished, the download is complete. Since DispatchWork is called every time a URL is downloaded, other Finished work instances may be activated to start working again.

The additional information of the 26th line request is introduced as a parameter in the callback method of the asynchronous request, and will be mentioned later.

Line 3 starts asynchronous requests. Here, a callback method needs to be passed in as the processing method in response to the request, and the parameters of the callback method are also passed in.

Row 28th registers a time-out processing method TimeoutCallback for the asynchronous request. The maximum wait time is _ maxTime, and only one time-out is processed. The additional information of the request is passed in as the parameter of the callback method.

 

The definition of RequestState is

1 class RequestState 2 {3 private const int BUFFER_SIZE = 131072; // the size of space for receiving data packets 4 private byte [] _ data = new byte [BUFFER_SIZE]; // buffer 5 private StringBuilder _ sb = new StringBuilder (); // stores all received characters 6 7 public HttpWebRequest Req {get; private set ;} // request 8 public string Url {get; private set;} // request URL 9 public int Depth {get; private set ;} // The relative depth of this request is 10 public int Index {get; private set ;}// worker instance id 11 public Stream ResStream {get; set ;} // receives the data stream 12 public StringBuilder Html13 {14 get15 {16 return _ sb; 17} 18} 19 20 public byte [] Data21 {22 get23 {24 return _ data; 25} 26} 27 28 public int BufferSize29 {30 get31 {32 return BUFFER_SIZE; 33} 34} 35 36 public RequestState (HttpWebRequest req, string url, int depth, int index) 37 {38 Req = req; 39 Url = url; 40 Depth = depth; 41 Index = index; 42} 43}

TimeoutCallback is defined

1 private void TimeoutCallback (object state, bool timedOut) 2 {3 if (timedOut) // determine whether it is time-out 4 {5 RequestState rs = state as RequestState; 6 if (rs! = Null) 7 {8 rs. req. abort (); // cancel request 9} 10 _ reqsBusy [rs. index] = false; // reset the working status 11 DispatchWork (); // assign a new task 12} 13}

 

The next step is to process the request response.

1 private void ReceivedResource (IAsyncResult ar) 2 {3 RequestState rs = (RequestState) ar. asyncState; // The parameter 4 HttpWebRequest req = rs. req; 5 string url = rs. url; 6 try 7 {8 HttpWebResponse res = (HttpWebResponse) req. endGetResponse (ar); // get response 9 if (_ stop) // determine whether to abort download 10 {11 res. close (); 12 req. abort (); 13 return; 14} 15 if (res! = Null & res. statusCode = HttpStatusCode. OK) // determine whether the response is successfully obtained 16 {17 Stream resStream = res. getResponseStream (); // get the resource stream 18 rs. resStream = resStream; 19 var result = resStream. beginRead (rs. data, 0, rs. bufferSize, // asynchronous request to read data 20 new AsyncCallback (ReceivedData), rs); 21} 22 else // response failed 23 {24 res. close (); 25 rs. req. abort (); 26 _ reqsBusy [rs. index] = false; // reset the working status 27 DispatchWork (); // assign a new task 28} 29} 30 catch (WebException we) 31 {32 MessageBox. show ("ReceivedResource" + we. message + url + we. status); 33} 34}

The asynchronous method is used to read data streams in line 1 because we used asynchronous requests. Otherwise, data cannot be normally received.

This asynchronous read method is read by package. Therefore, once a packet is received, the incoming callback method ReceivedData is called, and then the received data is processed in this method.

This method also transmits the rs. Data Space and rs. BufferSize of the received Data.

 

The next step is to receive and process data.

1 private void ReceivedData (IAsyncResult ar) 2 {3 RequestState rs = (RequestState) ar. asyncState; // obtain the parameter 4 HttpWebRequest req = rs. req; 5 Stream resStream = rs. resStream; 6 string url = rs. url; 7 int depth = rs. depth; 8 string html = null; 9 int index = rs. index; 10 int read = 0; 11 12 try13 {14 read = resStream. endRead (ar); // obtain the data read result 15 if (_ stop) // determine whether to abort download 16 {17 rs. resStream. close (); 18 req. abort (); 19 return; 20} 21 if (read> 0) 22 {23 MemoryStream MS = new MemoryStream (rs. data, 0, read); // use the obtained Data to create a memory stream 24 StreamReader reader = new StreamReader (MS, _ encoding); 25 string str = reader. readToEnd (); // read all characters 26 rs. html. append (str); // Add it to the end 27 var result = resStream. beginRead (rs. data, 0, rs. bufferSize, // read data again asynchronously 28 new AsyncCallback (ReceivedData), rs); 29 return; 30} 31 html = rs. html. toString (); 32 SaveContents (html, url); // save to local 33 string [] links = GetLinks (html); // obtain link 34 AddUrls (links, depth + 1); // filter the link and add it to the undownloaded set 35 36 _ reqsBusy [index] = false; // reset the working status 37 DispatchWork (); // assign a new task 38} 39 catch (WebException we) 40 {41 MessageBox. show ("ReceivedData Web" + we. message + url + we. status); 42} 43}

Row 3 obtains the read data size. If read> 0, the data may not have been read. Therefore, Row 27 continues to request to read the next data packet;

If read <= 0 indicates that all data has been received, then rs. Html stores the complete HTML data, and you can proceed to the next step.

Line 3 Concatenates the obtained string to the end of the previously saved string, and finally obtains the complete HTML string.

 

Then let's take a look at how all tasks are processed.

1 private void StartDownload () 2 {3 _ checkTimer = new Timer (new TimerCallback (CheckFinish), null, 0,300); 4 DispatchWork (); 5} 6 7 private void CheckFinish (object param) 8 {9 if (_ workingSignals. isFinished () // check whether all work instances are Finished10 {11 _ checkTimer. dispose (); // stop the timer 12 _ checkTimer = null; 13 if (DownloadFinish! = Null & _ ui! = Null) // determine whether the completion event 14 {15 _ ui. Dispatcher. Invoke (DownloadFinish, _ index) is registered; // call event 16} 17} 18}

A timer is created in row 3rd. Every ms, CheckFinish is called to determine whether the task is completed.
Row 15th provides an event to complete the task and can be registered with the customer program. _ Index stores the number of currently downloaded URLs.

This event is defined

1 public delegate void DownloadFinishHandler (int count ); 2 3 /// <summary> 4 /// triggered 5 after the download and analysis are complete. /// </summary> 6 public event DownloadFinishHandler DownloadFinish = null;
GJM: Reproduced from the http://www.cnblogs.com/Jiajun/archive/2012/06/17/2552458.html if the author copyright problems please contact me 993056011@163.com

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.