It is a very useful program on the Internet. Search engines use spider programs to collect web pages to data libraries. Enterprises use spider programs to monitor competitors' websites and track changes, individual users can download web pages with Spider programs for offline use. developers can use spider programs to scan their web to check invalid links ...... Spider programs have different purposes for different users. So how does a spider program work?
A spider is a semi-automated program. Just as a real spider travels on its Web (Spider's Web), the Spider Program also travels online through web links in a similar way. The Spider Program is semi-automated because it always requires an initial link (starting point), but the subsequent running status is determined by itself, the Spider Program scans the links contained in the starting page, accesses the directed pages of these links, and analyzes and traces the links contained in those pages. Theoretically, the Spider Program will access every page on the Internet, because almost every page on the Internet is always referenced by more or less other pages.
This article describes how to use C # To construct a Spider Program, which can download the content of the entire website to a specified directory, the program running interface 1. You can easily use the several core classes provided in this article to construct your own Spider Program.
C # is particularly suitable for constructing spider programs because it has built in HTTP access and multithreading capabilities, and these two capabilities are critical for Spider programs. The following are the key issues to be addressed when constructing a Spider Program:
- HTML analysis: Some HTML Parser is required to analyze every page that the Spider Program encounters.
- Page processing: You need to process each page downloaded. The downloaded content may be saved to the disk or analyzed and processed further.
- Multithreading: only with multithreading can spider programs be truly efficient.
- Determine when to complete: it is not easy to check whether the task has been completed, especially in a multi-threaded environment.
I. HTML Parsing
C # the language itself does not support HTML parsing, but supports XML parsing. However, XML has a strict syntax, And the parser designed for XML is useless to HTML, because HTML syntax is much looser. Therefore, we need to design an HTML Parser. The parser provided in this article is highly independent, and you can easily use it for other scenarios that use C # To process HTML.
The HTML Parser provided in this article is implemented by the parsehtml class and easy to use: first, create an instance of this class and set its source attribute to the HTML document to be parsed:
ParseHTML parse = new ParseHTML(); parse.Source = "<p>Hello World</p>";
Next, we can use loops to check all text and tags contained in HTML documents. Generally, the check process starts from the while loop of a test EOF method:
while(!parse.Eof()) { char ch = parse.Parse();
The parse method returns the characters contained in the HTML document. The returned content only contains the non-HTML characters. If an HTML Tag is encountered, the parse method returns 0, indicates that an HTML Tag is encountered. After a tag is encountered, we can use the gettag () method to process it.
if(ch==0) { HTMLTag tag = parse.GetTag(); }
Generally, one of the most important tasks of a spider program is to find out each href attribute, which can be completed by using the C # index function. For example, the following code extracts the value of the href attribute (if any ).
Attribute href = tag ["href"]; string link = href. value; after obtaining the attribute object, you can get the value of this attribute through attribute. value.
Ii. Processing HTML pages
Next let's take a look at how to handle HTML pages. The first thing to do is download the HTML page, which can be implemented through the httpwebrequest class provided by C:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(m_uri); response = request.GetResponse(); stream = response.GetResponseStream();
Next, we will create a stream from the request. Before executing other processing operations, determine whether the file is a binary file or a text file. Different file types are processed differently. The following code determines whether the file is a binary file.
if( !response.ContentType.ToLower().StartsWith("text/") ) { SaveBinaryFile(response); return null; } string buffer = "",line;
If the file is not a text file, we read it as a binary file. For a text file, first create a streamreader from stream and add a row of text file content to the buffer zone.
Reader = new streamreader (Stream); While (line = reader. Readline ())! = NULL) {buffer + = line + "\ r \ n";} after loading the entire file, save it as a text file. Savetextfile (buffer );
Let's take a look at the storage methods of these two types of different files.
The content type declaration of a binary file does not start with "text/". The Spider Program directly saves the binary file to the disk and does not need to perform additional processing. This is because the binary file does not contain HTML, therefore, there will be no more html links that require spider processing. The following describes how to write a binary file.
First, prepare a buffer to temporarily Save the binary file content. Byte [] buffer = new byte [1024]; next, determine the path and name of the file to be saved locally. If you want to download the content of a myhost.com website to the local c: \ test folder, and the online path and name of the binary file are http://myhost.com/images/logo.gif, the local path and name should be C: \ test \ images \ logo.gif. At the same time, make sure that the images subdirectory has been created under the c: \ test directory. This part of the task is completed by the convertfilename method. String filename = convertfilename (response. responseuri); The convertfilename method separates the HTTP address and creates a directory structure. After determining the name and path of the output file, you can open the input stream for reading the web page and writing the output stream of the local file. Stream outstream = file. create (filename); stream instream = response. getresponsestream (); then you can read the content of the Web file and write it to the local file, which can be easily completed through a loop. Int L; do {L = instream. read (buffer, 0, buffer. length); If (L> 0) outstream. write (buffer, 0, L);} while (L> 0); after writing the entire file, close the input stream and output stream. Outstream. Close (); instream. Close ();
In comparison, it is easier to download text files. The content type of a text file always starts with "text. Assume that the file has been downloaded and saved to a string. This string can be used to analyze links contained in the web page, and can also be saved as files on the disk. The task of the following code is to save the text file.
string filename = convertFilename( m_uri ); StreamWriter outStream = new StreamWriter( filename ); outStream.Write(buffer); outStream.Close();
Here, we first open a file output stream, write the buffer content to the stream, and finally close the file.
3. Multithreading
Multithreading makes the computer look like it can perform more than one operation at a time. However, unless the computer contains multiple processors, the so-called simultaneous execution of multiple operations is just a Simulated effect-the result of performing multiple operations at the same time is achieved by the computer's fast switching between multiple threads. In general, only multithreading in two cases can actually speed up the program running. The first case is that the computer has multiple processors, and the second case is that the program often waits for an external event.
For a Spider Program, the second case is one of its typical features. Every time it sends a URL request, it always needs to wait until the file is downloaded and then request the next URL. If the spider program can request multiple URLs at the same time, it can obviously effectively reduce the total download time.
Therefore, we use the documentworker class to encapsulate all the operations for downloading a URL. Every time a documentworker instance is created, it enters a loop and waits for the next URL to be processed. Below is the main cycle of documentworker:
while(!m_spider.Quit ) { m_uri = m_spider.ObtainWork(); m_spider.SpiderDone.WorkerBegin(); string page = GetPage(); if(page!=null) ProcessPage(page); m_spider.SpiderDone.WorkerEnd(); }
This loop will run until the quit mark is set to true (when the user clicks the "cancel" button, the quit mark is set to true ). Within the loop, we call obtainwork to obtain a URL. Obtainwork will wait until there is a URL available-this can be obtained only after other threads parse the document and find a link. The done class uses workerbegin and workerend methods to determine when the entire download operation has been completed.
Figure 1 shows that the Spider Program allows the user to determine the number of threads to use. In practice, the optimum number of threads is affected by many factors. If your machine has a high performance or two processors, you can set a large number of threads. Otherwise, if the network bandwidth and machine performance are limited, setting too many threads does not necessarily improve performance.
4. Is the task completed?
Using multiple threads to download files at the same time effectively improves performance, but also brings about thread management problems. One of the most complex problems is: When does a Spider Program complete the work? Here we need to use a dedicated class done to determine.
First, it is necessary to explain the specific meaning of "finished work. Only when there is no URL waiting for download in the system and all worker threads have finished processing is the task of the Spider Program completed. That is to say, completing the work means that there is no URL waiting for download or being downloaded.
The done class provides a waitdone method that waits until the done object detects that the Spider Program has completed its work. The following is the code of the waitdone method.
public void WaitDone() { Monitor.Enter(this); while ( m_activeThreads>0 ) { Monitor.Wait(this); } Monitor.Exit(this); }
The waitdone method will wait until there are no active threads. However, it must be noted that there is no active thread in the initial phase of the download, so it is easy to cause the Spider Program to stop immediately at the beginning. To solve this problem, we also need another method waitbegin to wait for the Spider Program to enter the "formal" working stage. The general call order is: first call waitbegin, then call waitdone, waitdone will wait for the Spider Program to complete the work. The following is the code for waitbegin:
public void WaitBegin() { Monitor.Enter(this); while ( !m_started ) { Monitor.Wait(this); } Monitor.Exit(this); }
The waitbegin method will wait until the m_started mark is set. M_started flag is set by workerbegin. Workerbegin is called by the worker thread when processing URLs. workerend is called when processing ends. Workerbegin and workerend methods help the done object determine the current working status. The code for the workerbegin method is as follows:
public void WorkerBegin() { Monitor.Enter(this); m_activeThreads++; m_started = true; Monitor.Pulse(this); Monitor.Exit(this); }
The workerbegin method first increases the number of active threads, then sets the m_started mark, and finally calls the pulse method to notify (which may exist) the thread waiting for the start of the working thread. As mentioned above, the waitbegin method may be used to wait for the done object. After a URL is processed, the workerend method is called:
public void WorkerEnd() { Monitor.Enter(this); m_activeThreads--; Monitor.Pulse(this); Monitor.Exit(this); }
The workerend method reduces the m_activethreads active thread counter and calls the pulse to release the thread that may be waiting for the done object. As mentioned above, the waitdone method may be used to wait for the done object.
Conclusion: This article introduces the basic knowledge of developing Internet spider programs. The source code provided below will help you further understand the topic of this article. The Code provided here is very flexible, and you can easily use it for your own program.
Source code: http://myblog.workgroup.cn/files/folders/csharp/entry1639.aspx
It is a very useful program on the Internet. Search engines use spider programs to collect web pages to data libraries. Enterprises use spider programs to monitor competitors' websites and track changes, individual users can download web pages with Spider programs for offline use. developers can use spider programs to scan their web to check invalid links ...... Spider programs have different purposes for different users. So how does a spider program work?
A spider is a semi-automated program. Just as a real spider travels on its Web (Spider's Web), the Spider Program also travels online through web links in a similar way. The Spider Program is semi-automated because it always requires an initial link (starting point), but the subsequent running status is determined by itself, the Spider Program scans the links contained in the starting page, accesses the directed pages of these links, and analyzes and traces the links contained in those pages. Theoretically, the Spider Program will access every page on the Internet, because almost every page on the Internet is always referenced by more or less other pages.
This article describes how to use C # To construct a Spider Program, which can download the content of the entire website to a specified directory, the program running interface 1. You can easily use the several core classes provided in this article to construct your own Spider Program.
C # is particularly suitable for constructing spider programs because it has built in HTTP access and multithreading capabilities, and these two capabilities are critical for Spider programs. The following are the key issues to be addressed when constructing a Spider Program:
- HTML analysis: Some HTML Parser is required to analyze every page that the Spider Program encounters.
- Page processing: You need to process each page downloaded. The downloaded content may be saved to the disk or analyzed and processed further.
- Multithreading: only with multithreading can spider programs be truly efficient.
- Determine when to complete: it is not easy to check whether the task has been completed, especially in a multi-threaded environment.
I. HTML Parsing
C # the language itself does not support HTML parsing, but supports XML parsing. However, XML has a strict syntax, And the parser designed for XML is useless to HTML, because HTML syntax is much looser. Therefore, we need to design an HTML Parser. The parser provided in this article is highly independent, and you can easily use it for other scenarios that use C # To process HTML.
The HTML Parser provided in this article is implemented by the parsehtml class and easy to use: first, create an instance of this class and set its source attribute to the HTML document to be parsed:
ParseHTML parse = new ParseHTML(); parse.Source = "<p>Hello World</p>";
Next, we can use loops to check all text and tags contained in HTML documents. Generally, the check process starts from the while loop of a test EOF method:
while(!parse.Eof()) { char ch = parse.Parse();
The parse method returns the characters contained in the HTML document. The returned content only contains the non-HTML characters. If an HTML Tag is encountered, the parse method returns 0, indicates that an HTML Tag is encountered. After a tag is encountered, we can use the gettag () method to process it.
if(ch==0) { HTMLTag tag = parse.GetTag(); }
Generally, one of the most important tasks of a spider program is to find out each href attribute, which can be completed by using the C # index function. For example, the following code extracts the value of the href attribute (if any ).
Attribute href = tag ["href"]; string link = href. value; after obtaining the attribute object, you can get the value of this attribute through attribute. value.
Ii. Processing HTML pages
Next let's take a look at how to handle HTML pages. The first thing to do is download the HTML page, which can be implemented through the httpwebrequest class provided by C:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(m_uri); response = request.GetResponse(); stream = response.GetResponseStream();
Next, we will create a stream from the request. Before executing other processing operations, determine whether the file is a binary file or a text file. Different file types are processed differently. The following code determines whether the file is a binary file.
if( !response.ContentType.ToLower().StartsWith("text/") ) { SaveBinaryFile(response); return null; } string buffer = "",line;
If the file is not a text file, we read it as a binary file. For a text file, first create a streamreader from stream and add a row of text file content to the buffer zone.
Reader = new streamreader (Stream); While (line = reader. Readline ())! = NULL) {buffer + = line + "\ r \ n";} after loading the entire file, save it as a text file. Savetextfile (buffer );
Let's take a look at the storage methods of these two types of different files.
The content type declaration of a binary file does not start with "text/". The Spider Program directly saves the binary file to the disk and does not need to perform additional processing. This is because the binary file does not contain HTML, therefore, there will be no more html links that require spider processing. The following describes how to write a binary file.
First, prepare a buffer to temporarily Save the binary file content. Byte [] buffer = new byte [1024]; next, determine the path and name of the file to be saved locally. If you want to download the content of a myhost.com website to the local c: \ test folder, and the online path and name of the binary file are http://myhost.com/images/logo.gif, the local path and name should be C: \ test \ images \ logo.gif. At the same time, make sure that the images subdirectory has been created under the c: \ test directory. This part of the task is completed by the convertfilename method. String filename = convertfilename (response. responseuri); The convertfilename method separates the HTTP address and creates a directory structure. After determining the name and path of the output file, you can open the input stream for reading the web page and writing the output stream of the local file. Stream outstream = file. create (filename); stream instream = response. getresponsestream (); then you can read the content of the Web file and write it to the local file, which can be easily completed through a loop. Int L; do {L = instream. read (buffer, 0, buffer. length); If (L> 0) outstream. write (buffer, 0, L);} while (L> 0); after writing the entire file, close the input stream and output stream. Outstream. Close (); instream. Close ();
In comparison, it is easier to download text files. The content type of a text file always starts with "text. Assume that the file has been downloaded and saved to a string. This string can be used to analyze links contained in the web page, and can also be saved as files on the disk. The task of the following code is to save the text file.
string filename = convertFilename( m_uri ); StreamWriter outStream = new StreamWriter( filename ); outStream.Write(buffer); outStream.Close();
Here, we first open a file output stream, write the buffer content to the stream, and finally close the file.
3. Multithreading
Multithreading makes the computer look like it can perform more than one operation at a time. However, unless the computer contains multiple processors, the so-called simultaneous execution of multiple operations is just a Simulated effect-the result of performing multiple operations at the same time is achieved by the computer's fast switching between multiple threads. In general, only multithreading in two cases can actually speed up the program running. The first case is that the computer has multiple processors, and the second case is that the program often waits for an external event.
For a Spider Program, the second case is one of its typical features. Every time it sends a URL request, it always needs to wait until the file is downloaded and then request the next URL. If the spider program can request multiple URLs at the same time, it can obviously effectively reduce the total download time.
Therefore, we use the documentworker class to encapsulate all the operations for downloading a URL. Every time a documentworker instance is created, it enters a loop and waits for the next URL to be processed. Below is the main cycle of documentworker:
while(!m_spider.Quit ) { m_uri = m_spider.ObtainWork(); m_spider.SpiderDone.WorkerBegin(); string page = GetPage(); if(page!=null) ProcessPage(page); m_spider.SpiderDone.WorkerEnd(); }
This loop will run until the quit mark is set to true (when the user clicks the "cancel" button, the quit mark is set to true ). Within the loop, we call obtainwork to obtain a URL. Obtainwork will wait until there is a URL available-this can be obtained only after other threads parse the document and find a link. The done class uses workerbegin and workerend methods to determine when the entire download operation has been completed.
Figure 1 shows that the Spider Program allows the user to determine the number of threads to use. In practice, the optimum number of threads is affected by many factors. If your machine has a high performance or two processors, you can set a large number of threads. Otherwise, if the network bandwidth and machine performance are limited, setting too many threads does not necessarily improve performance.
4. Is the task completed?
Using multiple threads to download files at the same time effectively improves performance, but also brings about thread management problems. One of the most complex problems is: When does a Spider Program complete the work? Here we need to use a dedicated class done to determine.
First, it is necessary to explain the specific meaning of "finished work. Only when there is no URL waiting for download in the system and all worker threads have finished processing is the task of the Spider Program completed. That is to say, completing the work means that there is no URL waiting for download or being downloaded.
The done class provides a waitdone method that waits until the done object detects that the Spider Program has completed its work. The following is the code of the waitdone method.
public void WaitDone() { Monitor.Enter(this); while ( m_activeThreads>0 ) { Monitor.Wait(this); } Monitor.Exit(this); }
The waitdone method will wait until there are no active threads. However, it must be noted that there is no active thread in the initial phase of the download, so it is easy to cause the Spider Program to stop immediately at the beginning. To solve this problem, we also need another method waitbegin to wait for the Spider Program to enter the "formal" working stage. The general call order is: first call waitbegin, then call waitdone, waitdone will wait for the Spider Program to complete the work. The following is the code for waitbegin:
public void WaitBegin() { Monitor.Enter(this); while ( !m_started ) { Monitor.Wait(this); } Monitor.Exit(this); }
The waitbegin method will wait until the m_started mark is set. M_started flag is set by workerbegin. Workerbegin is called by the worker thread when processing URLs. workerend is called when processing ends. Workerbegin and workerend methods help the done object determine the current working status. The code for the workerbegin method is as follows:
public void WorkerBegin() { Monitor.Enter(this); m_activeThreads++; m_started = true; Monitor.Pulse(this); Monitor.Exit(this); }
The workerbegin method first increases the number of active threads, then sets the m_started mark, and finally calls the pulse method to notify (which may exist) the thread waiting for the start of the working thread. As mentioned above, the waitbegin method may be used to wait for the done object. After a URL is processed, the workerend method is called:
public void WorkerEnd() { Monitor.Enter(this); m_activeThreads--; Monitor.Pulse(this); Monitor.Exit(this); }
The workerend method reduces the m_activethreads active thread counter and calls the pulse to release the thread that may be waiting for the done object. As mentioned above, the waitdone method may be used to wait for the done object.
Conclusion: This article introduces the basic knowledge of developing Internet spider programs. The source code provided below will help you further understand the topic of this article. The Code provided here is very flexible, and you can easily use it for your own program.
Source code: http://myblog.workgroup.cn/files/folders/csharp/entry1639.aspx