Asp.net c # page capture methods

Source: Internet
Author: User

I. webpage update
We know that the information on general web pages is constantly updated, which also requires us to regularly grasp the new information, But how should we understand this "regular, that is, how long it takes to capture the page. In fact, this period is the page cache time. It is unnecessary for us to capture the page again during the page cache time, it puts pressure on other servers.
For example, if I want to capture the blog homepage, I first clear the page cache,

From Last-Modified to Expires, we can see that the cache time in the blog garden is 2 minutes, and I can also see the current server time Date, if I try again

If you refresh the page, the Date will be changed to If-Modified-Since, and then sent to the server to check whether the browser cache has expired?

Finally, the server finds the time when If-Modified-Since> = Last-Modifined is used, and the server returns 304, But it finds that this cookie information is really a lot of thieves.

In actual development, if we know the website cache policy, we can make the crawler crawl for 2 minutes. Of course, these can be configured and maintained by the data team, next we will simulate it with crawlers.
Copy codeThe Code is as follows:
Using System;
Using System. Net;

Namespace ConsoleApplication2
{
Public class Program
{
Static void Main (string [] args)
{
DateTime prevDateTime = DateTime. MinValue;

For (int I = 0; I <10; I ++)
{
Try
{
Var url = "http://cnblogs.com ";

Var request = (HttpWebRequest) HttpWebRequest. Create (url );

Request. Method = "Head ";

If (I> 0)
{
Request. IfModifiedSince = prevDateTime;
}

Request. Timeout = 3000;

Var response = (HttpWebResponse) request. GetResponse ();

Var code = response. StatusCode;

// If the server returns status 200, the webpage is deemed to have been updated. Remember the server time at that time.
If (code = HttpStatusCode. OK)
{
PrevDateTime = Convert. ToDateTime (response. Headers [HttpResponseHeader. Date]);
}

Console. WriteLine ("current server status code: {0}", code );
}
Catch (WebException ex)
{
If (ex. Response! = Null)
{
Var code = (ex. Response as HttpWebResponse). StatusCode;

Console. WriteLine ("current server status code: {0}", code );
}
}
}
}
}
}


Ii. webpage Encoding Problems

Sometimes we have captured the web page. When we are going to parse it, all the tmd code is garbled. It's really cool, as shown below,


Perhaps we vaguely remember that there is a charset attribute in the html meta, which records the encoding method, and the other key point is response. the CharacterSet attribute also records the encoding method. Let's try again.

It's still garbled, and it hurts. This time, I need to go to the official website to see what interaction exists in the http header information. Why can the browser be properly displayed? crawlers cannot crawl over it.

After checking the http header information, we finally got it. The browser says that I can parse the three compression methods: gzip, deflate, and sdch. The server sends gzip compression, here we should also know the commonly used web performance optimization.
Copy codeThe Code is as follows:
Using System;
Using System. Collections. Generic;
Using System. Linq;
Using System. Text;
Using System. Threading;
Using HtmlAgilityPack;
Using System. Text. RegularExpressions;
Using System. Net;
Using System. IO;
Using System. IO. Compression;

Namespace ConsoleApplication2
{
Public class Program
{
Static void Main (string [] args)
{
// Var currentUrl = "http://www.mm5mm.com /";

Var currentUrl = "http://www.sohu.com /";

Var request = WebRequest. Create (currentUrl) as HttpWebRequest;

Var response = request. GetResponse () as HttpWebResponse;

Var encode = string. Empty;

If (response. CharacterSet = "ISO-8859-1 ")
Encode = "gb2312 ";
Else
Encode = response. CharacterSet;

Stream stream;

If (response. ContentEncoding. ToLower () = "gzip ")
{
Stream = new GZipStream (response. GetResponseStream (), CompressionMode. Decompress );
}
Else
{
Stream = response. GetResponseStream ();
}

Var sr = new StreamReader (stream, Encoding. GetEncoding (encode ));

Var html = sr. ReadToEnd ();
}
}
}


Iii. Web page resolution

Since the web page has been obtained through a great deal of hard work, it will be resolved next. Of course, regular expression matching is a good method. After all, the workload is still relatively large, and the industry may also highly recommend the HtmlAgilityPack parsing tool, the ability to parse Html into XML, and then use XPath to extract specified content, greatly improving the development speed and performance. After all, Agility is agile, you can see the two figures of W3CSchool.


Copy codeThe Code is as follows:
Using System;
Using System. Collections. Generic;
Using System. Linq;
Using System. Text;
Using System. Threading;
Using HtmlAgilityPack;
Using System. Text. RegularExpressions;
Using System. Net;
Using System. IO;
Using System. IO. Compression;

Namespace ConsoleApplication2
{
Public class Program
{
Static void Main (string [] args)
{
// Var currentUrl = "http://www.mm5mm.com /";

Var currentUrl = "http://www.sohu.com /";

Var request = WebRequest. Create (currentUrl) as HttpWebRequest;

Var response = request. GetResponse () as HttpWebResponse;

Var encode = string. Empty;

If (response. CharacterSet = "ISO-8859-1 ")
Encode = "gb2312 ";
Else
Encode = response. CharacterSet;

Stream stream;

If (response. ContentEncoding. ToLower () = "gzip ")
{
Stream = new GZipStream (response. GetResponseStream (), CompressionMode. Decompress );
}
Else
{
Stream = response. GetResponseStream ();
}

Var sr = new StreamReader (stream, Encoding. GetEncoding (encode ));

Var html = sr. ReadToEnd ();

Sr. Close ();

HtmlDocument document = new HtmlDocument ();

Document. LoadHtml (html );

// Extract title
Var title = document. DocumentNode. SelectSingleNode ("// title"). InnerText;

// Extract keywords
Var keywords = document. DocumentNode. SelectSingleNode ("// meta [@ name = 'keyword']"). Attributes ["content"]. Value;
}
}
}


Okay, close the job and go to bed...

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.