C # It is particularly suitable for building spider Program This is because it already has built-in HTTP access and multithreading capabilities, and these two capabilities are very critical for Spider programs. The following are the key issues to be addressed when constructing a Spider Program:
(1) HTML analysis: an HTML Parser is required to analyze every page that a Spider Program encounters.
(2) page processing: You need to process each downloaded page. The downloaded content may be saved to the disk or analyzed and processed further.
(3) multithreading: only with multithreading can spider programs be truly efficient.
(4) determine when to complete: Do not underestimate the problem. It is not easy to determine whether the task has been completed, especially in a multi-threaded environment.
I. HTML Parsing
The HTML Parser provided in this article is implemented by the parsehtml class and easy to use: first, create an instance of this class and set its source attribute to the HTML document to be parsed:
Parsehtml parse = new parsehtml ();
Parse. Source ="
Hello World
";
Next, we can use loops to check all text and tags contained in HTML documents. Generally, the check process starts from the while loop of a test EOF method:
While (! Parse. EOF ())
{
Char CH = parse. parse ();
The parse method returns the characters contained in the HTML document. The returned content only contains the non-HTML characters. If an HTML Tag is encountered, the parse method returns 0, indicates that an HTML Tag is encountered. After a tag is encountered, we can use the gettag () method to process it.
If (CH = 0)
{
Htmltag tag = parse. gettag ();
}
Generally, one of the most important tasks of a spider program is to find out each href attribute, which can be completed by using the C # index function. For exampleCodeThe value of the href attribute will be extracted (if any ).
Attribute href = tag ["href"];
String link = href. value;
After obtaining the attribute object, you can obtain the attribute value through attribute. value.
Ii. Processing HTML pages
Next let's take a look at how to handle HTML pages. The first thing to do is download the HTML page, which can be implemented through the httpwebrequest class provided by C:
Httpwebrequest request = (httpwebrequest) webrequest. Create (m_uri );
Response = request. getresponse ();
Stream = response. getresponsestream ();
Next, we will create a stream from the request. Before executing other processing operations, determine whether the file is a binary file or a text file. Different file types are processed differently. The following code determines whether the file is a binary file.
If (! Response. contenttype. tolower (). startswith ("text /"))
{
Savebinaryfile (response );
Return NULL;
}
String Buffer = "", line;
If the file is not a text file, we read it as a binary file. For a text file, first create a streamreader from stream and add a row of text file content to the buffer zone.
Reader = new streamreader (Stream );
While (line = reader. Readline ())! = NULL)
{
Buffer + = line + "RN ";
}
After the entire file is loaded, save it as a text file.
Savetextfile (buffer );
Let's take a look at the storage methods of these two types of different files.
The content type declaration of a binary file does not start with "text/". The Spider Program directly saves the binary file to the disk and does not need to perform additional processing. This is because the binary file does not contain HTML, therefore, there will be no more html links that require spider processing. The following describes how to write a binary file.
First, prepare a buffer to temporarily Save the binary file content. Byte [] buffer = new byte [1024];
Next, determine the path and name of the file to be saved locally. If you want to download the content of a myhost.com website to the local C: Test folder, the online path and name of the binary file are http://myhost.com/images/logo.gif, then the ingress and name should be c:testimageslogo.gif. At the same time, make sure that the images subdirectory has been created under the C: test directory. This part of the task is completed by the convertfilename method.
String filename = convertfilename (response. responseuri );
The convertfilename method separates the HTTP address and creates a directory structure. After determining the name and path of the output file, you can open the input stream for reading web pages and writing the output stream of the local file.
Stream outstream = file. Create (filename );
Stream instream = response. getresponsestream ();
Next, you can read the content of the Web file and write it to the local file, which can be easily completed through a loop.
Int L;
Do
{
L = instream. Read (buffer, 0,
Buffer. Length );
If (L> 0)
Outstream. Write (buffer, 0, L );
} While (L> 0 );
3. Multithreading
We use the documentworker class to encapsulate all the operations for downloading a URL. Every time a documentworker instance is created, it enters a loop and waits for the next URL to be processed. Below is the main cycle of documentworker:
While (! M_spider.quit)
{
M_uri = m_spider.obtainwork ();
M_spider.spiderdone.workerbegin ();
String page = getpage ();
If (page! = NULL)
Processpage (PAGE );
M_spider.spiderdone.workerend ();
}
This loop will run until the quit mark is set to true (when the user clicks the "cancel" button, the quit mark is set to true ). Within the loop, we call obtainwork to obtain a URL. Obtainwork will wait until there is a URL available-this can be obtained only after other threads parse the document and find a link. The done class uses workerbegin and workerend methods to determine when the entire download operation has been completed.
Figure 1 shows that the Spider Program allows the user to determine the number of threads to use. In practice, the optimum number of threads is affected by many factors. If your machine has a high performance or two processors, you can set a large number of threads. Otherwise, if the network bandwidth and machine performance are limited, setting too many threads does not necessarily improve performance.
4. Is the task completed?
Using multiple threads to download files at the same time effectively improves performance, but also brings about thread management problems. One of the most complex problems is: When does a Spider Program complete the work? Here we need to use a dedicated class done to determine.
First, it is necessary to explain the specific meaning of "finished work. Only when there is no URL waiting for download in the system and all worker threads have finished processing is the task of the Spider Program completed. That is to say, completing the work means that there is no URL waiting for download or being downloaded.
The done class provides a waitdone method that waits until the done object detects that the Spider Program has completed its work. The following is the code of the waitdone method.
Public void waitdone ()
{
Monitor. Enter (this );
While (m_activethreads> 0)
{
Monitor. Wait (this );
}
Monitor. Exit (this );
}
The waitdone method will wait until there are no active threads. However, it must be noted that there is no active thread in the initial phase of the download, so it is easy to cause the Spider Program to stop immediately at the beginning. To solve this problem, we also need another method waitbegin to wait for the Spider Program to enter the "formal" working stage. The general call order is: first call waitbegin, then call waitdone, waitdone will wait for the Spider Program to complete the work. The following is the code for waitbegin:
Public void waitbegin ()
{
Monitor. Enter (this );
While (! M_started)
{
Monitor. Wait (this );
}
Monitor. Exit (this );
}
The waitbegin method will wait until the m_started mark is set. M_started flag is set by workerbegin. Workerbegin is called by the worker thread when processing URLs. workerend is called when processing ends. Workerbegin and workerend methods help the done object determine the current working status. The code for the workerbegin method is as follows:
Public void workerbegin ()
{
Monitor. Enter (this );
M_activethreads ++;
M_started = true;
Monitor. Pulse (this );
Monitor. Exit (this );
}
The workerbegin method first increases the number of active threads, then sets the m_started mark, and finally calls the pulse method to notify (possibly) threads waiting for the worker thread to start. As mentioned above, the waitbegin method may be used to wait for the done object. After a URL is processed, the workerend method is called:
Public void workerend ()
{
Monitor. Enter (this );
M_activethreads --;
Monitor. Pulse (this );
Monitor. Exit (this );
}
The workerend method reduces the m_activethreads active thread counter and calls the pulse to release the thread that may be waiting for the done object. As mentioned above, the waitdone method may be used to wait for the done object.