Asp.net c # page capture methods

Last Update:2013-10-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. webpage update
We know that the information on general web pages is constantly updated, which also requires us to regularly grasp the new information, But how should we understand this "regular, that is, how long it takes to capture the page. In fact, this period is the page cache time. It is unnecessary for us to capture the page again during the page cache time, it puts pressure on other servers.
For example, if I want to capture the blog homepage, I first clear the page cache,

From Last-Modified to Expires, we can see that the cache time in the blog garden is 2 minutes, and I can also see the current server time Date, if I try again

If you refresh the page, the Date will be changed to If-Modified-Since, and then sent to the server to check whether the browser cache has expired?

Finally, the server finds the time when If-Modified-Since> = Last-Modifined is used, and the server returns 304, But it finds that this cookie information is really a lot of thieves.

In actual development, if we know the website cache policy, we can make the crawler crawl for 2 minutes. Of course, these can be configured and maintained by the data team, next we will simulate it with crawlers.
Copy codeThe Code is as follows:
Using System;
Using System. Net;

Namespace ConsoleApplication2
{
Public class Program
{
Static void Main (string [] args)
{
DateTime prevDateTime = DateTime. MinValue;

For (int I = 0; I <10; I ++)
{
Try
{
Var url = "http://cnblogs.com ";

Var request = (HttpWebRequest) HttpWebRequest. Create (url );

Request. Method = "Head ";

If (I> 0)
{
Request. IfModifiedSince = prevDateTime;
}

Request. Timeout = 3000;

Var response = (HttpWebResponse) request. GetResponse ();

Var code = response. StatusCode;

// If the server returns status 200, the webpage is deemed to have been updated. Remember the server time at that time.
If (code = HttpStatusCode. OK)
{
PrevDateTime = Convert. ToDateTime (response. Headers [HttpResponseHeader. Date]);
}

Console. WriteLine ("current server status code: {0}", code );
}
Catch (WebException ex)
{
If (ex. Response! = Null)
{
Var code = (ex. Response as HttpWebResponse). StatusCode;

Console. WriteLine ("current server status code: {0}", code );
}
}
}
}
}
}

Ii. webpage Encoding Problems

Sometimes we have captured the web page. When we are going to parse it, all the tmd code is garbled. It's really cool, as shown below,

Perhaps we vaguely remember that there is a charset attribute in the html meta, which records the encoding method, and the other key point is response. the CharacterSet attribute also records the encoding method. Let's try again.

It's still garbled, and it hurts. This time, I need to go to the official website to see what interaction exists in the http header information. Why can the browser be properly displayed? crawlers cannot crawl over it.

After checking the http header information, we finally got it. The browser says that I can parse the three compression methods: gzip, deflate, and sdch. The server sends gzip compression, here we should also know the commonly used web performance optimization.
Copy codeThe Code is as follows:
Using System;
Using System. Collections. Generic;
Using System. Linq;
Using System. Text;
Using System. Threading;
Using HtmlAgilityPack;
Using System. Text. RegularExpressions;
Using System. Net;
Using System. IO;
Using System. IO. Compression;

Namespace ConsoleApplication2
{
Public class Program
{
Static void Main (string [] args)
{
// Var currentUrl = "http://www.mm5mm.com /";

Var currentUrl = "http://www.sohu.com /";

Var request = WebRequest. Create (currentUrl) as HttpWebRequest;

Var response = request. GetResponse () as HttpWebResponse;

Var encode = string. Empty;

If (response. CharacterSet = "ISO-8859-1 ")
Encode = "gb2312 ";
Else
Encode = response. CharacterSet;

Stream stream;

If (response. ContentEncoding. ToLower () = "gzip ")
{
Stream = new GZipStream (response. GetResponseStream (), CompressionMode. Decompress );
}
Else
{
Stream = response. GetResponseStream ();
}

Var sr = new StreamReader (stream, Encoding. GetEncoding (encode ));

Var html = sr. ReadToEnd ();
}
}
}

Iii. Web page resolution

Since the web page has been obtained through a great deal of hard work, it will be resolved next. Of course, regular expression matching is a good method. After all, the workload is still relatively large, and the industry may also highly recommend the HtmlAgilityPack parsing tool, the ability to parse Html into XML, and then use XPath to extract specified content, greatly improving the development speed and performance. After all, Agility is agile, you can see the two figures of W3CSchool.

Copy codeThe Code is as follows:
Using System;
Using System. Collections. Generic;
Using System. Linq;
Using System. Text;
Using System. Threading;
Using HtmlAgilityPack;
Using System. Text. RegularExpressions;
Using System. Net;
Using System. IO;
Using System. IO. Compression;

Namespace ConsoleApplication2
{
Public class Program
{
Static void Main (string [] args)
{
// Var currentUrl = "http://www.mm5mm.com /";

Var currentUrl = "http://www.sohu.com /";

Var request = WebRequest. Create (currentUrl) as HttpWebRequest;

Var response = request. GetResponse () as HttpWebResponse;

Var encode = string. Empty;

If (response. CharacterSet = "ISO-8859-1 ")
Encode = "gb2312 ";
Else
Encode = response. CharacterSet;

Stream stream;

If (response. ContentEncoding. ToLower () = "gzip ")
{
Stream = new GZipStream (response. GetResponseStream (), CompressionMode. Decompress );
}
Else
{
Stream = response. GetResponseStream ();
}

Var sr = new StreamReader (stream, Encoding. GetEncoding (encode ));

Var html = sr. ReadToEnd ();

Sr. Close ();

HtmlDocument document = new HtmlDocument ();

Document. LoadHtml (html );

// Extract title
Var title = document. DocumentNode. SelectSingleNode ("// title"). InnerText;

// Extract keywords
Var keywords = document. DocumentNode. SelectSingleNode ("// meta [@ name = 'keyword']"). Attributes ["content"]. Value;
}
}
}

Okay, close the job and go to bed...

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Asp.net c # page capture methods

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Asp.net c # page capture methods

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support