Use C #2.0 to implement webspider)

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Abstract: This article discusses how to use C #2.0 to capture network resources. Use this Program , You can scan the entire Internet web site through an ingress web site (such as a http://www.comprg.com.cn) and download the web resources that these scanned web sites point to locally. Other analysis tools can be used to further analyze these network resources, such as extracting keywords and classifying indexes. You can also use these network resources as data sources to implement the same search engine as Google.
Key words: C #2.0, HTML, web spider, key tree, regular expression 1, Introduction

In recent years, the search engine, headed by Google, has attracted more and more attention. Before Google appeared, many search service providers used to manually collect information from the Internet and classify the information as a data source of the search engine. For example, Yahoo began to collect information from the Internet by thousands of people. In this way, although the classification of information will be user-friendly and accurate, with the explosive growth of internet information, it is impossible for Internet users to collect information manually. However, all of this has been completely changed with the advent of Google. Google's unconventional approach is to constantly get network resources from the Internet through a program 7x24, and then use some intelligenceAlgorithmAnalyze the network resources downloaded to the local device, and then index the analyzed data to form a complete search engine that basically does not require manual intervention. Using this mode, search engines can even obtain all the information in the Internet within a few days, saving a lot of money and time costs. One of the most important components of this search engine is the network spider that provides data sources for the search engine. That is to say, implementing a web spider is the first and most important step to implement a search engine.

Ii. Basic Implementation ideas and steps of web spider

The main function of a web spider is to continuously download network resources from the Internet. The basic idea is to get more URLs through one or more portals. After downloading and analyzing the Network Resources pointed to by these URLs, obtain the URLs contained in these network resources, and so on until there is no available URLs. The following describes how to use a program to implement a web spider.

1. Specify one (or more) entry URL (such as a http://www.comprg.com.cn) and add it to the download Queue (there is only one or more entry URLs in the download Queue ).
2. the thread responsible for downloading Network Resources obtains one or more URLs from the download queue, and downloads the Network Resources pointed to by these URLs to the local device (before downloading, generally, you should determine whether the URL has been downloaded. If it has been downloaded, ignore this URL ). If there is no URL in the download queue and all the download threads are in sleep state, all the network resources drawn from the entry URL have been downloaded. In this case, the web spider prompts that the download is complete and stops.
3. Analyze the unanalyzed Network Resources downloaded to the local device (generally htmlCodeAnd obtain the URL (for example, the value of the href attribute in tag <A> ).
4. Add the URL obtained in step 1 to the download queue. And re-Execute step 1.

3. Data Input and Output

We can see from the steps of implementing the web spider that the Read and Write URL operations of the download queue have been throughout the system. Although this download queue can be implemented using the. Queue class, you should be clear that there are not dozens or hundreds of URLs on the Internet. The cost is tens of millions. Obviously, so many URLs cannot be stored in the queue object in the memory. Therefore, we need to store it in a larger storage space, which is the hard disk.
This document uses a common text file to save the URL to be downloaded and analyzed (This text file is the download Queue ). The storage format is that each row is a URL. Since the URL is saved in a text file, you need to read and write the text file. Therefore, this section implements a fileio class used to operate this text file.
Before implementing the fileio class, let's talk about how to operate this text file. To use this file as a queue, You need to append the file and read data from the beginning of the file. Let's first Add rows to the file. The implementation code is as follows:

Implementation Code for appending rows to a file
// These two variables are class global variables
Private filestream FSW;
Private streamwriter SW;

// Create a file stream and streamwriter object used to append rows to a file
Public void openwritefile (string file)
{
If (! File. exists (File) // if the file does not exist, create the file first.
File. Create (file). Close ();
// Open the file in append Mode
FSW = new filestream (file, filemode. append, fileaccess. Write, fileshare. readwrite );
// Create a streamwriter object based on the created filestream object
Sw = new streamwriter (FSW );
}
// Close the file stream.
Public void closewritefile ()
{
If (FSR! = NULL)
FSW. Close ();
}
// Append a line of string to the file
Public void writeline (string S)
{
Sw. writeline (s );
Sw. Flush (); // refresh the write buffer to make this row visible to the Read File stream
}

When implementing the above Code, note that you must use fileshare when creating a filestream object. readwrite. Otherwise, the file cannot be opened by two or more streams. That is to say, the Read File stream to be introduced below will not be able to operate the file opened by the written file stream. The implementation code for reading rows from a file is as follows:

Implementation Code for reading rows from a file
// These two variables are class global variables
Private filestream FSR;
Private streamreader SR;

// Create a file stream and streamwriter object used to read a file row
Public void openreadfile (string file)
{
If (! File. exists (File) // if the file does not exist, create the file first.
File. Create (file). Close ();
FSR = new filestream (file, filemode. openorcreate, fileaccess. read,
Fileshare. readwrite );
Sr = new streamreader (FSR );
}
// Close the Read File stream
Public void closereadfile ()
{
If (FSR! = NULL)
FSR. Close ();
}
// Read a row from the file
Public String Readline ()
{
If (Sr. endofstream) // if the file stream pointer already points to the end of the file, null is returned.
Return NULL;
Return Sr. Readline ();
}

In addition to the Code for reading and writing files, fileio also provides an iseof method to determine whether the file stream pointer is located at the end of the file. The implementation code of the iseof method is as follows:

Implementation Code of the iseof Method
// Used to determine whether the file stream pointer is located at the end of the file
Public bool iseof ()
{
Return Sr. endofstream;
}

The fileio class is not only used to read and write the download queue. As we will talk about later, when a web spider downloads network resources through multiple threads, each thread stores the downloaded network resources in its own directory. Each directory contains an index.txt file that stores the URL of the network resource in the current directory. Append a URL to the index.txt file using fileio(index.txt does not need to be read, but needs to continually append rows ).

IV. Implementation of thread classes

To make the network spider download speed as much as possible in a limited hardware environment. The cheapest and quickest way is to use multithreading. In. Net framework2.0, multiple thread functions are provided. The Core Thread class is thread. Generally, you can use the following code to create and run a thread:

DEMO code for using a thread in C #
Private void fun ()
{
// Code to be executed by the thread
}
Public void testthread ()
{
Thread thread;
Thread = new thread (fun); // create a thread object and set fun as the method for running the thread.
Thread. Start (); // run a thread
}

Although the Code above is relatively simple to create and run a thread, the Code still does not seem transparent, that is, the client still needs to explicitly use the Thread class when calling the thread. Next we will implement a mythread class for creating threads. Any class in C # can automatically become a Thread class only by inheriting this class. The code for the mythread class is as follows:

Implementation Code of the mythread class
// After any C # class inherits mythread, it will automatically become a Thread class
Class mythread
{
Private thread;
Public mythread ()
{
Thread = new thread (run); // create a thread object
}
// Method used to run the thread code. The subclass of mythread must overwrite this method.
Public Virtual void run ()
{
}
Public void start ()
{
Thread. Start (); // start the running thread, that is, start the run method.
}
// Sleep the current thread in milliseconds
Public void sleep (INT millisecondstimeout)
{
Thread. Sleep (millisecondstimeout );
}
}

We can use the mythread class with reference to the following code:

Code of the tested threadclass class
Class threadclass: mythread
{
Public override void run ()
{
// The thread code to be executed
}
}

// Test the threadclass class
Public void testthreadclass ()
{
Threadclass Tc = new threadclass ();
TC. Start (); // start the running thread, that is, the run method.
}

You can check whether the above Code is more convenient, intuitive, and easy to use than using the Thread class directly. It also has some object-oriented feelings!

5. Download network resources with multiple threads

Generally, web crawlers use multiple threads to download network resources. As for how to use multithreading to download, different versions of web spider are different. For convenience and ease of understanding, the web spider discussed in this article uses every thread to download network resources to a directory of its own, that is, each thread corresponds to a directory. After the number of network resources downloaded in the current directory reaches a certain value (such as 5000), this thread will create a new directory and continue to download network resources from 0. This section describes downloadthread, a thread class used to download network resources. The main function of this class is to obtain a certain number of URLs from the download queue for download and analysis. Many other important classes are involved in the downloadthread class. These classes will be described later. Here, let's take a look at the downloadthread class implementation code.

Downloadthread class code
Class downloadthread: mythread
{
// The parseresource class is used to download and Analyze network resources.
Private parseresource Pr = new parseresource ();
Private int currentcount = 0; // Number of webpages in the current download directory
// Write the URL of the current directory to index.txt in each line Directory
Private fileio = new fileio ();
Private string path; // The current download directory (followed by "\")
Private string [] patterns; // The thread does not download a URL that complies with the regular expression in patterns.
Public bool stop = false; // when stop is set to true, the thread exits.
Public int threadid; // threadid of the current thread, used to distinguish other threads

Public downloadthread (string [] patterns)
{
Pr. findurl + = findurl; // assign a method to the findurl event
This. Patterns = patterns;
}
// This is an event method that occurs every time a URL is obtained
Private void findurl (string URL)
{
Common. addurl (URL); // Add the obtained URL to the download queue
}
Private void openfile () // open the index.txt file in the downloaded directory
{
Fileio. closewritefile ();
Fileio. openwritefile (path + common. indexfile );
}
Public override void run () // thread running Method
{
Vertex list <string> URLs = new vertex list <string> ();
Path = Common. getdir (); // get the download directory
Openfile ();
While (! Stop)
{
// When there is no URL in the download queue, wait cyclically
While (! Stop & URLs. Count = 0)
{
Common. geturls (URLs, 20); // obtain 20 URLs from the download queue
If (URLs. Count = 0) // If the URL is not obtained
{
// Notifies the system that the current thread is in the waiting state,
// If all threads are in the waiting state,
// Indicates that all network resources have been downloaded.
Common. threadwait (threadid );
Sleep (5000); // The current thread sleeps for 5 seconds
}
}
Stringbuilder sb = new stringbuilder ();
Foreach (string URL in URLs) // cyclically downloads and analyzes the 20 URLs
{
If (STOP) break;
// If the number of resource files in the downloaded directory is greater than or equal to the maximum number of files,
// Create a new directory and continue the download
If (currentcount> = Common. maxcount)
{
Path = Common. getdir ();
Openfile ();
Currentcount = 0; // directory
}
// Each downloaded resource file name is saved with a 5-digit sequence number (without an extension ),
// For example, 00001 or 00002. The following statement formats the file name.
String S = string. Format ("{0: D5}", currentcount + 1 );
SB. Remove (0, SB. Length );
SB. append (s );
SB. append (":");
SB. append (URL );
Try
{
// Download and analyze the current URL
Pr. parse (URL, path + S, patterns );
Common. Count ++;
// Write the previous URL into index.txt
Fileio. writeline (sb. tostring ());
Currentcount ++;
}
Catch (exception E)
{

}
}
URLs. Clear ();
}
}
}
}

6. Analyze network resources

Analyzing downloaded network resources is one of the most important functions of web crawlers. Here, network resources mainly refer to the href attribute value of the <A> tag in HTML code. The status and status are switched based on the characters read from the HTML file. The following describes the switching between statuses.

Status 0: Switch to status 1 after reading the '<' character. Read Other characters and the status remains unchanged.
Status 1: Read 'A' or 'A', switch to status 2, read other characters, and switch to status 0.
Status 2: Read space or tab (\ t), switch to status 3, read other characters, switch to status 0.
Status 3: read '>'. A <A> character is obtained successfully. Other characters are read, and the status remains unchanged. To make it easier to explain questions. In the web spider described in this article, only the URLs in the href attribute in <A> in HTML code are extracted. The analysis method used in this article is to extract href step by step. First, the <A> label in the HTML code is proposed. Excluding </a> and the preceding characters, as shown in <a href = "http://www.comprg.com.cn"> comprg </a> extract only <a href = "http://www.comprg.com.cn">, comprg </a> is ignored because there is no URL.
This article uses a state machine to extract <A>, which is divided into five States (0 to 4 ). The first state is the initial state, and the last state is the termination state. If the last state is reached, a <A>

State Machine 1.

Figure 1

The final state of the last double ring is the final state. Let's take a look at the implementation code for <A>.

Implementation of the geta Method
// Obtain the <A>
Private void geta ()
{
Char [] buffer = new char [1024];
Int state = 0;
String A = "";

while (! Sr. endofstream)
{ int n = sr. read (buffer, 0, buffer. length);
for (INT I = 0; I { switch (state)
{ case 0: // status 0
If (buffer [I] = '<') // read '
{ A + = buffer [I];
state = 1; // switch to status 1
} break;
case 1: // status 1
If (buffer [I] = 'A' | buffer [I] = 'A ') // The read value is 'A' or 'A'
{ A + = buffer [I];
state = 2; // switch to status 2
} else
{ A = "";
S Tate = 0; // switch to the status 0
} break;
case 2: // status 2
If (buffer [I] = ''| buffer [I] = '\ t ') // read space or '\ t'
{ A + = buffer [I];
state = 3;
} else
{ A = "";
state = 0; // switch to the status 0
} break;
case 3: // status 3
If (buffer [I] = '> ') // read '>'. A
{
A + = buffer [I] has been obtained.
try
{ string url = geturl (gethref (a); // obtain the value of the href attribute in
If (URL! = NULL)
{ If (findurl! = NULL)
findurl (URL); // triggers a URL event

}
}
Catch (exception E)
{
}
State = 0; // after obtaining a <A>, switch to the status 0 again.
}
Else
A + = buffer [I];
Break;
}
}
}
}

In the geta method, except for switching to the status 0, all other status switches assign the characters that have been read to the string variable A. If it is found that the string in variable A cannot be <A>, then, it clears a and switches to the status 0 and then reads the characters again.
An important method gethref is used in the geta method to obtain the href part from <A>. The gethref method is implemented as follows:

Implementation of the gethref Method
// Obtain href from <A>
Private string gethref (string)
{
Try
{
String P = @ "href \ s * = \ s * ('[^'] * '|" "[^" "] *" "| \ s +) "; // obtain the Regular Expression of href
Matchcollection matches = RegEx. Matches (A, P,
Regexoptions. ignorecase |
Regexoptions. explicitcapture );

Foreach (match nextmatch in matches)
{
Return nextmatch. value; // return href
}
Return NULL;
}
Catch (exception E)
{
Throw E;
}
}

The gethref method uses a regular expression to obtain href from <A>. In <A>, there are three conditions for the correct href attribute format. The main difference between the three conditions is the symbols on both sides of the URL, such as single quotation marks, double quotation marks, or no symbols. The three conditions are as follows:
Case 1: <a href = "http://www.comprg.com.cn"> comprg </a>
Scenario 2: <a href = 'HTTP: // www.comprg.com.cn '> comprg </a>
Case 3: <a href = http://www.comprg.com.cn> comprg </a>
In the gethref method, p Stores href used to filter the three cases. That is to say, using a regular expression, we can obtain the href of the above three cases as follows:

Href: href = "http://www.comprg.com.cn" obtained from Case 1"
Href: href = 'HTTP: // www.comprg.com.cn 'obtained from Case 2'
Href: href = http://www.comprg.com.cn obtained from Case 3

After obtaining the preceding href, You need to propose the URL. This function is completed by geturl. The implementation code of this method is as follows:

Implementation of the geturl Method
// Extract the URL from href
Private string geturl (string href)
{
Try
{
If (href = NULL) return href;
Int n = href. indexof ('='); // find the '=' location
String S = href. substring (n + 1 );
Int begin = 0, end = 0;
String Sign = "";
If (S. Contains ("\" ") // The first case
Sign = "\"";
Else if (S. Contains ("'") // The second case
Sign = "'";
Else // case 3
Return getfullurl (S. Trim ());
Begin = S. indexof (sign );
End = S. lastindexof (sign );

Return getfullurl (S. substring (begin + 1, end-begin-1). Trim ());
}
Catch (exception E)
{
Throw E;
}
}

Pay attention to this when getting a URL. Some URLs use relative paths, that is, they do not have the "http: // host" part, but they need to save their complete paths when saving the URLs. In this case, you need to obtain the complete paths based on the relative paths. This function is completed by the getfullurl method. The implementation code of this method is as follows:

Implementation Code of the getfullurl Method
// Change the relative path to the absolute path
Private string getfullurl (string URL)
{
Try
{
If (url = NULL) return URL;
If (processpattern (URL) return NULL; // filter URLs that do not want to be downloaded
// If the URL contains http: // or https: //, which is an absolute path, return as is
If (URL. tolower (). startswith ("http: //") | URL. tolower (). startswith ("https ://"))
Return URL;
Uri parenturi = new uri (parenturl );
String Port = "";
If (! Parenturi. isdefaultport)
Port = ":" + parenturi. Port. tostring ();
If (URL. startswith ("/") // The URL starts with "/" and is directly placed after the host
Return parenturi. scheme + ": //" + parenturi. Host + port + URL;
Else // The URL does not start with "/" and is placed behind the URL path
{
String S = "";
S = parenturi. localpath. substring (0, parenturi. localpath. lastindexof ("/"));
Return parenturi. scheme + ": //" + parenturi. Host + port + S + "/" + URL;
}
}
Catch (exception E)
{
Throw E;
}
}

Parseresource also provides a function that uses regular expressions to filter URLs that do not want to be downloaded. This function is implemented using the processpattern method. The actual modern code is as follows:

Implementation Code of the processpattern Method
// If true is returned, the URL conforms to pattern. Otherwise, the URL does not conform to pattern.
Private bool processpattern (string URL)
{
Foreach (string P in patterns)
{

If (RegEx. ismatch (URL, P, regexoptions. ignorecase | regexoptions. explicitcapture)
&&! P. Equals (""))
Return true;
}
Return false;
}
Before the parseresource class analyzes HTML code, it first downloads HTML to the local thread directory, and then opens and reads the data to be analyzed through filestream. For other implementation code of the parseresource class, seeSource code.

VII. Implementation of the key tree

When obtaining a URL, it is inevitable that some URLs will be obtained repeatedly. These duplicate URLs greatly increase the download time of web crawlers and cause other analysis tools to repeat the same HTML. Because of this, you need to filter out duplicate URLs, that is, to make the URLs downloaded by web crawlers unique. The simplest way to achieve this is to save the downloaded URL to a collection, and before downloading the new URL, check whether the new URL has been downloaded in this collection. If yes, ignore this URL.
This function is quite simple on the surface, but because we are dealing with thousands of URLs, it will not only occupy a large amount of memory space if we save these URLs in a collection similar to list, in addition, when there are many URLs, such as 1 million. In this case, you need to check whether the URL to be downloaded exists from the URL 1 million. Although some search algorithms (such as semi-query) can be used for processing, when the data volume is very large, the efficiency of any search algorithm will be greatly reduced. Therefore, a new storage structure must be designed to complete this work. This new data storage structure requires two features:

1. Reduce the memory used to store URLs as much as possible.
2. Search for URLs as quickly as possible (the best possible is that the search speed is irrelevant to the number of URLs ).

The following describes the first feature. Generally, a URL is long. For example, each URL contains 50 characters on average. If there are many URLs, each of which occupies 50 characters, 1 million URLs occupy 50 MB of storage space. The purpose of saving a URL is to find whether a URL exists. Therefore, you only need to save the hashcode of the URL. Because hashcode is of the int type, hashcode uses less storage space than a URL string.
For the second feature, we can use the key tree in the data structure. Assume that the number is 4532. First, convert it to a string. Then there are 10 key Tree nodes (0 to 9 ). In this case, the storage structure 2 of 4532 is shown as follows:

Figure 2

From the data structure above, we can see that finding an integer is only related to the number of digits of this integer, and is irrelevant to the number of integers. The implementation code of this key tree is as follows:

Keytree implementation code
Class keytreenode // structure of the key Tree node
{
// Pointer to the node containing the next integer
Public keytreenode [] pointers = new keytreenode [10];
// Delimiter flag. If it is true, it indicates that the current node is the last digit of the integer.
Public bool [] endflag = new bool [10];
}
Class keytree
{
Private keytreenode rootnode = new keytreenode (); // Root Node
// Add an unsigned integer to the key tree
Public void add (uint N)
{
String S = n. tostring ();
Keytreenode tempnode = rootnode;
Int Index = 0;
For (INT I = 0; I <S. length; I ++)
{
Index = int. parse (s [I]. tostring (); // obtain the value of each digit of an integer.
If (I = S. Length-1) // when the last digit of the integer is set to true
{
Tempnode. endflag [Index] = true;
Break;
}
If (tempnode. Pointers [Index] = NULL) // create a Node object when the pointer to the next node is null.
Tempnode. Pointers [Index] = new keytreenode ();
Tempnode = tempnode. Pointers [Index];
}
}
// Determine whether an integer exists
Public bool exists (uint N)
{
String S = n. tostring ();
Keytreenode tempnode = rootnode;
Int Index = 0;
For (INT I = 0; I <S. length; I ++)
{
If (tempnode! = NULL)
{
Index = int. parse (s [I]. tostring ());
// When the end sign of the last integer is true, it indicates that N exists.
If (I = S. Length-1) & (tempnode. endflag [Index] = true ))
Return true;
Else
Tempnode = tempnode. Pointers [Index];
}
Else
Return false;
}
Return false;
}
}

In the above Code, the keytreenode uses the ending mark, instead of judging whether the pointer is null or not because there may be unequal integers, such as 4321 and 432. If only the pointer is used. After 4321 is saved, 432 is also considered to exist. If the end flag is used, the end flag of the node with the value of 2 is false, indicating that 432 does not exist. The following urlfilter uses the preceding key tree to process URLs.

Implementation Code of the urlfilter class
// Used to re-combine the URL and then add it to the key tree
// Such as http://www.comprg.com.cn and http://www.comprg.com.cn/is the same
// Therefore, their hashcode also requires the same
Class urlfilter
{
Public static keytree urlhashcode = new keytree ();
Private Static object syncurlhashcode = new object ();
Private Static string processurl (string URL) // reassemble the URL
{
Try
{
Uri uri = new uri (URL );
String S = URI. pathandquery;
If (S. Equals ("/"))
S = "";
Return URI. Host + S;
}
Catch (exception E)
{
Throw E;
}
}
Private Static bool exists (string URL) // determines whether the URL exists
{
Try
{
Lock (syncurlhashcode)
{
Url = processurl (URL );
Return urlhashcode. exists (uint) URL. gethashcode ());
}
}
Catch (exception E)
{
Throw E;
}
}

Public static bool isok (string URL)
{
Return! Exists (URL );
}
// Add the processed URL to the key tree
Public static void addurl (string URL)
{
Try
{
Lock (syncurlhashcode)
{
Url = processurl (URL );
Urlhashcode. Add (uint) URL. gethashcode ());
}
}
Catch (exception E)
{
Throw E;
}
}

}

8. Implementation of other parts

So far, all the core code of web spider has been completed. Let's create an interface to visualize the download process. Interface 3.

Figure 3

This interface uses a timer to obtain the download status of a web spider every two seconds. Including the number of obtained URLs and the number of downloaded network resources. The status information is stored in a static variable of the common class. For the common class and main interface code, see the source code provided in this Article.

IX. Conclusion

So far, all the web spider programs have been completed. However, in practical applications, it is far from enough to download the entire network resource from a single machine. This requires joint download from multiple machines. However, this will bring us a difficult problem. These machines need to synchronize the downloaded URLs. Based on the example provided in this article, you can change it to a distributed web spider that can be downloaded simultaneously from multiple hosts. In this way, the download speed of Web Crawlers will be greatly improved.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use C #2.0 to implement webspider)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support