When performing web crawling and crawling tools, you often need to monitor and parse the page. The monitoring is to check whether the page content has been updated. The most direct way to determine whether a webpage changes is to set one of the pages as a monitoring area, capture the content of this area each time, and then compare it with the locally saved or most recent captured content, if there is a difference, it indicates that the webpage has changed before parsing. This method is relatively secure and can achieve almost foolproof results. However, this method downloads the page content during each scan, captures the content in the monitoring area, and finally performs string comparison. The whole process is time-consuming. In fact, among many webpages, some websites have static pages, slice, HTML, and JS. These static pages may already be prepared by the server, A user only downloads data during access. For such static pages, the status code can be used to determine whether the content has changed.
The status code is 304 (not modified ).CodeThe explanation is "if the client sends a conditional GET request that has been allowed, and the content of the document (since the last access or according to the request conditions) has not changed, the server should return this status code ". Obviously, through this explanation, we understand the implementation mechanism. We only need to add the last access time to the header when sending the request, and then judge based on the status code returned by the server. Generally, when a webpage changes, the server returns the status code 200, while if the page does not change, the server returns the status code 304.
DOTNET provides a complete API for network transmission. Next, see the specific implementation method. In this example, visit the banner page (http://www.stats.gov.cn/top.html) of the National Bureau of Statistics to check whether the page changed three days ago and three months ago, respectively.
Static Void Main ( String [] ARGs)
{
String URL = " Http://www.stats.gov.cn/top.html " ;
//Check whether the webpage has changed three days ago.
Datetime modifiedsince=Datetime. Now. adddays (-3);
//Output false
Console. writeline (pagehaschanged (URL, modifiedsince ));
// check whether changes occurred three months ago
modifiedsince = datetime. now. addmonths ( - 3 );
/// Output True
console. writeline (pagehaschanged (URL, modifiedsince);
}
Private Static BoolPagehaschanged (StringURL, datetime modifiedsince)
{
BoolChanged= False;
// Set Request Information
Httpwebrequest request = Webrequest. Create (URL) As Httpwebrequest;
// The most critical setting is to set the ifmodifiedsince of the request to the specified time.
Request. ifmodifiedsince = Modifiedsince;
Httpwebresponse response= Null;
Try
{
Response=Request. getresponse ()AsHttpwebresponse;
// Determines whether the content has changed based on the returned status code.
// 200 indicates that the page has changed
If (Response. statuscode = Httpstatuscode. OK)
{
Changed = True ;
}
Else If (Response. statuscode = Httpstatuscode. notmodified)
{
Changed = False ;
}
Response. Close ();
}
Catch (Webexception ex)
{
// For pages that have not changed, an exception is thrown to indicate that the page has not changed.
If (Ex. Response As Httpwebresponse). statuscode = Httpstatuscode. notmodified)
{
Changed = False ;
}
Else
{
Throw Ex;
}
}
ReturnChanged;
}
Note that an exception is thrown when the webpage content does not change. The status code can only be obtained from the exception information. In addition, when the 200 status code is returned, sometimes the page does not change because some servers do not recognize the lastmodifiedsince content in the request. Through the practice of the actual project, it is found that most of the static page content can be determined through this method.
If you have better methods, especially to determine whether dynamic pages are updated, please contact us.