Use the HTTP status code to check whether the webpage content is updated

Source: Internet
Author: User

When performing web crawling and crawling tools, you often need to monitor and parse the page. The monitoring is to check whether the page content has been updated. The most direct way to determine whether a webpage changes is to set one of the pages as a monitoring area, capture the content of this area each time, and then compare it with the locally saved or most recent captured content, if there is a difference, it indicates that the webpage has changed before parsing. This method is relatively secure and can achieve almost foolproof results. However, this method downloads the page content during each scan, captures the content in the monitoring area, and finally performs string comparison. The whole process is time-consuming. In fact, among many webpages, some websites have static pages, slice, HTML, and JS. These static pages may already be prepared by the server, A user only downloads data during access. For such static pages, the status code can be used to determine whether the content has changed.

The status code is 304 (not modified ).CodeThe explanation is "if the client sends a conditional GET request that has been allowed, and the content of the document (since the last access or according to the request conditions) has not changed, the server should return this status code ". Obviously, through this explanation, we understand the implementation mechanism. We only need to add the last access time to the header when sending the request, and then judge based on the status code returned by the server. Generally, when a webpage changes, the server returns the status code 200, while if the page does not change, the server returns the status code 304.

DOTNET provides a complete API for network transmission. Next, see the specific implementation method. In this example, visit the banner page (http://www.stats.gov.cn/top.html) of the National Bureau of Statistics to check whether the page changed three days ago and three months ago, respectively.

Static   Void Main ( String [] ARGs)
{
String URL =   " Http://www.stats.gov.cn/top.html " ;

//Check whether the webpage has changed three days ago.
Datetime modifiedsince=Datetime. Now. adddays (-3);
//Output false
Console. writeline (pagehaschanged (URL, modifiedsince ));

// check whether changes occurred three months ago
modifiedsince = datetime. now. addmonths ( - 3 );
/// Output True
console. writeline (pagehaschanged (URL, modifiedsince);
}

Private Static BoolPagehaschanged (StringURL, datetime modifiedsince)
{
BoolChanged= False;

// Set Request Information
Httpwebrequest request = Webrequest. Create (URL) As Httpwebrequest;
// The most critical setting is to set the ifmodifiedsince of the request to the specified time.
Request. ifmodifiedsince = Modifiedsince;

Httpwebresponse response= Null;
Try
{
Response=Request. getresponse ()AsHttpwebresponse;

// Determines whether the content has changed based on the returned status code.
// 200 indicates that the page has changed
If (Response. statuscode = Httpstatuscode. OK)
{
Changed =   True ;
}
Else   If (Response. statuscode = Httpstatuscode. notmodified)
{
Changed =   False ;
}

Response. Close ();
}
Catch (Webexception ex)
{
// For pages that have not changed, an exception is thrown to indicate that the page has not changed.
If (Ex. Response As Httpwebresponse). statuscode = Httpstatuscode. notmodified)
{
Changed =   False ;
}
Else
{
Throw Ex;
}
}

ReturnChanged;
}

Note that an exception is thrown when the webpage content does not change. The status code can only be obtained from the exception information. In addition, when the 200 status code is returned, sometimes the page does not change because some servers do not recognize the lastmodifiedsince content in the request. Through the practice of the actual project, it is found that most of the static page content can be determined through this method.

If you have better methods, especially to determine whether dynamic pages are updated, please contact us.

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.