Search engine research-network Spider Program Algorithm-related information Part I (5 parts in total)

Source: Internet
Author: User

Search engine research --- network Spider Program Algorithm

How to construct a Spider Program in C #
Spider is a very useful program on the Internet. Search engines use spider programs to collect web pages to databases. Enterprises use spider programs to monitor competitor websites and track changes, individual users use the Spider Program to download web pages for offline use. developers use the spider program to scan their own web check invalid links ...... Spider Programs vary with users. So how does a spider program work?

A spider is a semi-automated program. Just as a real spider travels on its Web (Spider's Web), the Spider Program also travels online through web links in a similar way. The reason why the Spider Program is semi-automated is that it always requires an initial link (starting point), but the subsequent running status will be determined by itself, the Spider Program scans the links contained in the starting page, then accesses the pages pointed to by these links, and then analyzes and traces the links contained in those pages. Theoretically, the Spider Program will access every page on the Internet, because almost every page on the Internet is always referenced by more or less other pages.

This article describes how to use C # To construct a Spider Program, which can download the content of the entire website to a specified directory, the program running interface 1. You can easily use the several core classes provided in this article to construct your own Spider Program.
C # is particularly suitable for constructing spider programs because it has built in HTTP access and multithreading capabilities, and these two capabilities are critical for Spider programs. The following are the key issues to be addressed when constructing a Spider Program:

(1) HTML analysis: an HTML Parser is required to analyze every page that a Spider Program encounters.

(2) page processing: You need to process each downloaded page. The downloaded content may be saved to the disk or analyzed and processed further.

(3) multithreading: only with multithreading can spider programs be truly efficient.

(4) determine when to complete: Do not underestimate the problem. It is not easy to determine whether the task has been completed, especially in a multi-threaded environment.

I. HTML Parsing

C # the language itself does not support HTML parsing, but supports XML parsing. However, XML has a strict syntax, And the parser designed for XML is useless to HTML, because HTML syntax is much looser. Therefore, we need to design an HTML Parser. The parser provided in this article is highly independent, and you can easily use it for other scenarios that use C # To process HTML.

The HTML Parser provided in this article is implemented by the parsehtml class and easy to use: first, create an instance of this class and set its source attribute to the HTML document to be parsed:

Parsehtml parse = new parsehtml (); parse. Source ="
Hello World

";

Next, we can use loops to check all text and tags contained in HTML documents. Generally, the check process starts from the while loop of a test EOF method:

While (! Parse. EOF () {char CH = parse. parse ();

The parse method returns the characters contained in the HTML document. The returned content only contains the non-HTML characters. If an HTML Tag is encountered, the parse method returns 0, indicates that an HTML Tag is encountered. After a tag is encountered, we can use the gettag () method to process it.

If (CH = 0) {htmltag tag = parse. gettag ();}

Generally, one of the most important tasks of a spider program is to find out each href attribute, which can be completed by using the C # index function. For example, the following code extracts the value of the href attribute (if any ).

Attribute href = tag ["href"]; string link = href. value;

After obtaining the attribute object, you can obtain the attribute value through attribute. value.

Ii. Processing HTML pages

Next let's take a look at how to handle HTML pages. The first thing to do is download the HTML page, which can be implemented through the httpwebrequest class provided by C:

Httpwebrequest request = (httpwebrequest) webrequest. Create (m_uri); response = request. getresponse (); stream = response. getresponsestream ();

Next, we will create a stream from the request. Before executing other processing operations, determine whether the file is a binary file or a text file. Different file types are processed differently. The following code determines whether the file is a binary file.

If (! Response. contenttype. tolower (). startswith ("text/") {savebinaryfile (response); return NULL;} string buffer = "", line;

If the file is not a text file, we read it as a binary file. For a text file, first create a streamreader from stream and add a row of text file content to the buffer zone.

Reader = new streamreader (Stream); While (line = reader. Readline ())! = NULL) {buffer + = line + "/R/N ";}

After the entire file is loaded, save it as a text file.

Savetextfile (buffer );

Let's take a look at the storage methods of these two types of different files.

The content type declaration of a binary file does not start with "text/". The Spider Program directly saves the binary file to the disk and does not need to perform additional processing. This is because the binary file does not contain HTML, therefore, there will be no more html links that require spider processing. The following describes how to write a binary file.

First, prepare a buffer to temporarily Save the binary file content. Byte [] buffer = new byte [1024];

Next, determine the path and name of the file to be saved locally. If you want to download the content of a myhost.com website to the local C:/test folder, the online path and name of the binary file are http://myhost.com/images/logo.gif, then the nickname and name should be c:/test/images/logo.gif. At the same time, make sure that the images subdirectory has been created under the C:/test directory. This part of the task is completed by the convertfilename method.

String filename = convertfilename (response. responseuri );

The convertfilename method separates the HTTP address and creates a directory structure. After determining the name and path of the output file, you can open the input stream for reading web pages and writing the output stream of the local file.

Stream outstream = file. Create (filename); stream instream = response. getresponsestream ();

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.