A c # Crawler
The articles you may see on codeproject are studied first:
Introduction
Web Crawler (also known as an ant or a spider) is a program that automatically captures web page data on the World Wide Web. web crawlers are generally used to capture a large number of web pages and serve the search engine in the future. the crawled web pages are indexed by some specialized programs (such as Lucene and dotlucene) to speed up the search. crawlers can also be used as link check servers or HTML code check servers to provide services. A new method is to check the e-mail address and prevent trackback spam.
Crawler Overview
In this article, I will introduce a simple crawler program written in C. this program targeted web page capture based on the input target URL address. the usage is quite simple. You only need to enter the website address you want to crawl and press "go.
This crawler has a queue that stores the URLs to be crawled. This design is the same as that of some large search engines. the crawling process is multi-threaded. The URL is retrieved from the URL queue for crawling, And the captured webpage is stored in the specified storage area ). use the C # socket library for Web requests. analyzes the links on the page currently being crawled and saves them to the URL Queue (the options for setting the capture depth are set in the settings)
View status
This program provides three statuses:
Capture thread list
Detailed information of each capture thread
View error information
View threads
The thread list shows all working threads. Each thread extracts a URI from the URI queue for link.
.
.
View requests
The request displays the list of all recently downloaded pages and the details in the HTTP header.
The following information is displayed for each request header:
Get, HTTP, 1.0
HOST: www.cnn.com
Connection: keep-alive
The Response Header displays the following information:
HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:39:05 GMT
Content-Length: 65730
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:40:05 GMT
Cache-control: Max-age = 60, private
Connection: keep-alive
Proxy-connection: keep-alive
Server: Apache
Last-modified: Sun, 19 Mar 2006 19:38:58 GMT
Vary: Accept-encoding, User-Agent
Via: 1.1 webcache (netcache netapp/6.0.1p3)
There is also a list of recently downloaded pages (parsing page)
Found: 356 REF (s)
Http://www.cnn.com/
Http://www.cnn.com/search/
Http://www.cnn.com/linkto/intl.html
Set
This program provides some parameter settings, including:
MIME types
Storage destination folder
Maximum number of captured threads
And so on...
File Type
The types of downloaded files supported by crawlers. Users can add MIME types. downloadable files include a default type for users to add, edit, and delete MIME types. you can set all MIME types to the following numbers.
Output
The output setting includes the download folder, and the number of requests should be kept as required to view the review request details.
Connection
The connection settings include:
Thread Count: multiple concurrent thread crawlers;
Thread sleep time when refs queue empty: At that time, when each threadsleeps refs queue is empty;
Thread sleep time between two connection: time, any request after each thread sleep, which is a very important reference value to prevent host crawler congestion due to heavy load.
Connection Timeout: indicates the connection timeout time;
Navigate through pages to a depth of: represents the depth;
Keep same URL server: When crawling is restricted, the original URL is stored on the same host.
Keepconnectionalive: Avoid reconnect time after enabling socket connection.
Advanced
Advanced Settings:
The text download page of the code page lists the user-defined list restrictions, prevent users from listing user-defined lists on any bad pages restrict host extensions so as not to block such host names. single-user-defined restricted file list extensions avoid paring non-text materials
Key points of interest
Maintain active connections:
Active connection persistence is a form that requires the client and server to maintain a connection. After the response is completed, you can add an httpheader request to the server. The following requirements apply:
GET/CNN/programs/Nancy. Grace/HTTP/1.0
HOST: www.cnn.com
Connection: keep-alive
"Connection: Keep Alive" tells the server that the connection will not be closed, but the server chooses to open or close it, but should reply to the client socket decision. therefore, the server can keep telling the customer that he has opened it including "connection: Keep Alive" in his replay as follows: HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:38:15 GMT
Content-Length: 29025
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:39:15 GMT
Cache-control: Max-age = 60, private
Connection: keep-alive
Proxy-connection: keep-alive
Server: Apache
Vary: Accept-encoding, User-Agent
Last-modified: Sun, 19 Mar 2006 19:38:15 GMT
Via: 1.1 webcache (netcache netapp/6.0.1p3)
You can also tell the customer that the rejection is as follows:
HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:38:15 GMT
Content-Length: 29025
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:39:15 GMT
Cache-control: Max-age = 60, private
Connection: Close
Server: Apache
Vary: Accept-encoding, User-Agent
Last-modified: Sun, 19 Mar 2006 19:38:15 GMT
Via: 1.1 webcache (netcache netapp/6.0.1p3)
Webrequest and webresponse problems:
When I started this article, my webrequest class and webresponse were like the following code:
Webrequest request = webrequest. Create (URI );
Webresponse response = request. getresponse ();
Stream streamin = response. getresponsestream ();
Binaryreader reader = new binaryreader (streamin, textencoding );
Byte [] recvbuffer = new byte [10240];
Int nbytes, ntotalbytes = 0;
While (nbytes = reader. Read (recvbuffer, 0, 10240)> 0)
{
Ntotalbytes + = nbytes;
}
Reader. Close ();
Streamin. Close ();
Response. Close ();
This program works well, but it has a very serious problem, because the webrequest class function getresponse locks into all other processes webrequest tells naili to respond to the last line of defense closed, in front of the Code. when I see that there is always a thread to download, while others are waiting for getresponse. to solve this serious problem, I have executed two classes of mywebrequest and mywebresponse. mywebrequest and mywebresponse use the socket class to manage connections. They are similar to webrequest and webresponse, but support parallel response at the same time. in addition, mywebrequest supports flag keepalive and keep-alive connections. so, my new code whocould is like:
Request = mywebrequest. Create (Uri, request/** // * to keep-alive */, keepalive );
Mywebresponse response = request. getresponse ();
Byte [] recvbuffer = new byte [10240];
Int nbytes, ntotalbytes = 0;
While (nbytes = response. Socket. Receive (recvbuffer, 0, 10240, socketflags. None)> 0)
{
Ntotalbytes + = nbytes;
If (response. keepalive & ntotalbytes> = response. contentlength & response. contentlength> 0)
Break;
}
If (response. keepalive = false)
Response. Close ();
Just changed getresponsestream and directly obtained the mywebresponse class of the socket member. in this way, I did a simple trick to enable the socket to read the first answer after next year, read a byte of time to tell the header to complete, such as the following code:
/** // * Reading Response Header */
Header = "";
Byte [] bytes = new byte [10];
While (socket. Receive (bytes, 0, 1, socketflags. None)> 0)
{
Header + = encoding. ASCII. getstring (bytes, 0, 1 );
If (Bytes [0] = '/N' & header. endswith ("/R/n/R/N "))
Break;
}
Therefore, the myresponse class will only continue to receive the page from the first position. thread management: the track type of the number of clues refers to the setting by the user. its default value is 10 threads, but it can change the setting options to connect. the crawler code processes this change in the property threadcount as shown in the following code:
Private int threadcount
{
Get {return nthreadcount ;}
Set
{
Monitor. Enter (this. listviewthreads );
For (INT nindex = 0; nindex <value; nindex ++)
{
If (threadsrun [nindex] = NULL | threadsrun [nindex]. threadstate! = Threadstate. Suspended)
{
Threadsrun [nindex] = new thread (New threadstart (threadrunfunction ));
Threadsrun [nindex]. Name = nindex. tostring ();
Threadsrun [nindex]. Start ();
If (nindex = This. listviewthreads. Items. Count)
{
Listviewitem item = This. listviewthreads. Items. Add (nindex + 1). tostring (), 0 );
String [] subitems = {"", "0", "0% "};
Item. subitems. addrange (subitems );
}
}
Else if (threadsrun [nindex]. threadstate = threadstate. susponded)
{
Listviewitem item = This. listviewthreads. items [nindex];
Item. imageindex = 1;
Item. subitems [2]. Text = "resume ";
Threadsrun [nindex]. Resume ();
}
}
Nthreadcount = value;
Monitor. Exit (this. listviewthreads );
}
}
If theadcode adds your code to create a new thread, or suspend the thread temporarily. in other cases, the additional working thread threads is supported in the leaf manufacturing process as follows. each worker thread has a thread array whose name is equal to its index. if the thread name value is greater than threadcount, continue to work and enter the stop mode. crawling depth: it is deep into the crawler goes, in the course of sailing. the initial depth of each URL is equivalent to the parent company's depth plus one. The depth of each URL is 0. after the transaction URL is inserted from any page, the URL queue at the end of the year indicates the action of "first in first out. and all threads can be inserted into the queue at any time in the following code:
Void enqueueuri (myuri URI)
{
Monitor. Enter (queueurls );
Try
{
Queueurls. enqueue (URI );
}
Catch (exception)
{
}
Monitor. Exit (queueurls );
}
Each thread can retrieve the first URL in the queue and require it to be in the following code:
Myuri dequeueuri ()
{
Monitor. Enter (queueurls );
Myuri uri = NULL;
Try
{
Uri = (myuri) queueurls. dequeue ();
}
Catch (exception)
{
}
Monitor. Exit (queueurls );
Return URI;
}