GJM: Implementing Web Crawler with C # (ii)

Source: Internet
Author: User

Web crawler plays a great role in information retrieval and processing, and is an important tool to collect network information.

The next step is to introduce the simple implementation of the crawler.

The crawler's workflow is as follows

The crawler begins to download network resources from the specified URL until the specified resources for that address and all child addresses have been downloaded.

The following steps begin to analyze the implementation of the crawler.

1. The collection to download and the downloaded collection

In order to save the URLs that need to be downloaded, and to prevent duplicate downloads, we need to use two separate collections to store the URLs that will be downloaded and the URLs that have already been downloaded.

Because you save the URL with some other information related to the URL, such as depth, I use dictionary to store these URLs.

The specific type is dictionary<string, int> where string is the URL string and int is the depth of the URL relative to the base URL.

The downloaded collection is checked each time it starts, and if it is already empty, the description has already been downloaded, and if there is a URL, take the first URL and add it to the downloaded collection and download the resource for the URL.

2. HTTP requests and Responses

C # already has a packaged HTTP request and response class HttpWebRequest and HttpWebResponse, so it's a lot easier to implement.

In order to improve the efficiency of the download, we can use multiple requests concurrently to download the resources of multiple URLs at the same time, a simple way is to adopt the method of asynchronous request.

The number of control concurrency can be implemented in the following ways

1 private void Dispatchwork () 2 {3     if (_stop)//Determines whether to abort download 4     {5         return; 6     } 7 for     (int i = 0; I < _ Reqcount; i++) 8     {9         if (!_reqsbusy[i])//Determine if the working instance of this number is free of         {one             requestresource (i);//Let this work instance request resources for the resource.         }13     }14}

Because no new threads are explicitly opened, a working instance is used to represent a logical worker thread

1 private bool[] _reqsbusy = null; Each element represents whether a working instance is working 2 private int _reqcount = 4; Number of work instances

Each time a work instance completes its work, the corresponding _reqsbusy is set to false and the dispatchwork is called, then Dispatchwork can assign a new task to the idle instance.

Next is the Send request

 1 private void Requestresource (int index) 2 {3 int depth; 4 string url = "";               5 Try 6 {7 Lock (_locker) 8 {9 if (_urlsunload.count <= 0)//Determine if there are no downloaded URL10               {One _workingsignals.finishworking (index);//Set the status of the work instance to Finished12 return;13 }14 _reqsbusy[index] = true;15 _workingsignals.startworking (index); Set the working state to Working16 depth = _urlsunload.first (). Value; Remove the first non-downloaded URL17 url = _urlsunload.first (). key;18 _urlsloaded.add (URL, depth); Add the URL to the downloaded _urlsunload.remove (URL);          Remove the URL from the download}21 HttpWebRequest req = (HttpWebRequest) webrequest.create (URL); 23 Req. Method = _method; Request method Req. Accept = _accept; Accepted content of req. useragent = _useragent; User Agent RequestState rs = new RequestState (req, url, depth, index);  Parameter 27 of the callback method        var result = req. BeginGetResponse (New AsyncCallback (Receivedresource), RS); Asynchronously requests threadpool.registerwaitforsingleobject (result. AsyncWaitHandle,//Registration timeout processing Method Timeoutcallback, RS, _maxtime, true);}31 catch (WebException W e) {MessageBox.Show ("Requestresource" + We. Message + URL + We. Status); 34}35}

The 7th line is to ensure synchronization of multiple tasks concurrently, plus a mutex. _locker is a member variable of type object.

The 9th row determines whether the collection is empty, sets the current work instance state to Finished if it is empty, and if not, sets it to working and takes out a URL to start the download. When all the work instances are finished, the download is complete. Because Dispatchwork is called after each download of a URL, it is possible to activate another finished work instance to start working again.

The additional information for the request on line 26th is passed in as a parameter in the callback method of the asynchronous request, which is also mentioned later.

Line 27th begins an asynchronous request, where a callback method is passed in as a response to the request, and the parameters of the callback method are passed in.

Line 28th registers a time-out processing method for the asynchronous request Timeoutcallback, the maximum wait time is _maxtime, and only one timeout is processed, and the additional information passed in the request is used as a parameter to the callback method.

The definition of requeststate is

 1 class RequestState 2 {3 Private const int buffer_size = 131072;//Receive packet space size 4 private byte[] _data = new byte [Buffer_size]; Buffer 5 for receiving packets private StringBuilder _SB = new StringBuilder (); Store all received characters 6 7 public HttpWebRequest Req {get; private set;}//Request 8 public string Url {get; private set;}/     /requested URL 9 public int Depth {get; private set;}//The relative depth of this request is the public int Index {get; private set;}//Work instance number 11             Public stream Resstream {get; set;}//Receive data stream StringBuilder Html13 {get15 {16  Return _sb;17}18}19 public byte[] Data21 {get23} {return _DATA;25}26}27-public int BufferSize29 {get31 {return buffer_s         IZE;33}34}35 $ public requeststate (HttpWebRequest req, string url, int depth, int index) 37 {38 Req = req;39 Url = url;40 Depth = depth;41 Index = index;42}43}  

The definition of Timeoutcallback is

1 private void Timeoutcallback (object state, bool timedout) 2 {3     if (timedout)//Determine if the timeout is 4     {5         requeststate R s = state as requeststate; 6         if (rs! = null) 7         {8             Rs. Req.abort (); Revocation Request 9         }10         _reqsbusy[rs. Index] = false; Reset the working state one by one         dispatchwork ();//Assign a new task to     }13}

The next step is to handle the request response.

 1 private void Receivedresource (IAsyncResult ar) 2 {3 RequestState rs = (requeststate) ar. asyncstate; The parameter passed in when the request was obtained 4 HttpWebRequest req = Rs. Req; 5 String url = rs. URL; 6 Try 7 {8 HttpWebResponse res = (HttpWebResponse) req. EndGetResponse (AR); Gets the response 9 if (_stop)//Determines whether to abort the download of {one res. Close (); req. Abort (); return;14}15 if (res! = NULL && res. StatusCode = = Httpstatuscode.ok)//Determine if the response was successfully obtained from (+) Stream Resstream = Res. GetResponseStream (); Get the resource flow at Rs. Resstream = resstream;19 var result = Resstream.beginread (Rs. Data, 0, Rs.         BufferSize,//asynchronous request read Data AsyncCallback (Receiveddata), RS),}22 else//Response Failure 23 {Res. Close (); Rs. Req.abort (); _reqsbusy[rs. Index] = false; Reset working status Dispatchwork (); Assigning new tasks}29}30 catch (WebException We)MessageBox.Show ("Receivedresource" + We. Message + URL + We. Status);}34} 

Line 19th here is an asynchronous way to read the data stream because we used an asynchronous request, otherwise it would not be able to receive the data normally.

The asynchronous read is read by package, so once a package is received, the incoming callback method Receiveddata is called, and the received data is processed in the method.

This method also passes the spatial RS of the received data. Data and the size of the space rs.buffersize.

The next step is to receive data and processing

 1 private void Receiveddata (IAsyncResult ar) 2 {3 RequestState rs = (requeststate) ar. asyncstate; Get the parameter 4 HttpWebRequest req = Rs. Req; 5 Stream Resstream = Rs. Resstream; 6 String url = rs. URL; 7 int depth = Rs. Depth; 8 string html = NULL; 9 int index = Rs. index;10 int read = 0;11 try13 {_stop read = Resstream.endread (AR);///////////Get Data Read results If the break is aborted, download the Rs. Resstream.close (); req. Abort (); return;20}21 if (Read > 0): {MemoryStream ms = new Memo Rystream (Rs. Data, 0, read); Create a memory stream with the obtained data StreamReader reader = new StreamReader (MS, _encoding), and string str = reader. ReadToEnd (); Read all characters in Rs. Html.append (str); Added to the end of the previous var result = Resstream.beginread (Rs. Data, 0, Rs.   BufferSize,//asynchronously requests read Data again AsyncCallback (Receiveddata), RS); Return;30}31      html = Rs. Html.tostring (); savecontents (HTML, URL); Save to local string[] links = getlinks (HTML); Get links to pages in Addurls (links, depth + 1); Filter links and add to non-downloaded collection _reqsbusy[index] = false; Reset the working state of the dispatchwork (); Assign a new task}39 catch (WebException We), {MessageBox.Show ("Receiveddata Web" + We. Message + URL + We.  Status); 42}43}

The 14th line obtains the read data size, read, if the read>0 indicates that the data may not have been read, so in line 27 continue to request reading the next packet;

If read<=0 indicates that all data has been received, then Rs. HTML contains the full HTML data, you can do the next step of processing.

Line 26th appends the string that was once saved to the previous string, and finally gets the full HTML string.

And then tell me about the process of judging all the tasks done

1 private void Startdownload () 2 {3     _checktimer = new Timer (new TimerCallback (checkfinish), NULL, 0, +); 4     Disp Atchwork (); 5} 6  7 private void Checkfinish (object param) 8 {9     if (_workingsignals.isfinished ())//Check whether all work instances are finished10
   {11         _checktimer.dispose ();//Stop timer         _checktimer = null;13         if (downloadfinish! = NULL && _ui! = NULL)//Determine if the completion event is registered (             _ui). Dispatcher.invoke (Downloadfinish, _index); Call Event         }17     }18}

Line 3rd creates a timer that calls Checkfinish once per 300ms to determine whether the task is complete.
Line 15th provides an event to complete the task, which can be registered with the client program. The number of current download URLs is stored in the _index.

The definition of the event is

1 public delegate void Downloadfinishhandler (int count); 2 3//<SUMMARY>4///Full link download analysis is complete and triggers 5//</SUMMARY>6 Public event Downloadfinishhandler downloadfinish = null;
 GJM: Reprinted from http://www.cnblogs.com/Jiajun/archive/2012/06/17/2552458.html in 2016-11-16 please contact me if the author's copyright is affected [email protected]

GJM: Implementing Web Crawler with C # (ii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.