asp.net C # Crawl Page Information Method Introduction _ Basic Application

Source: Internet
Author: User
Tags xpath
One: web page update
We know that the information in the general Web pages is constantly refurbished, this also requires us to catch the new information on a regular basis, but this "regular" how to understand, that is, how long to grasp the page, in fact, this is the regular page cache time, in the page cache time we crawl the page is not necessary, Instead, it creates pressure on others ' servers.
For example, I want to crawl the homepage of the blog, first empty the page cache,

From Last-modified to expires, we can see that the blog's cache time is 2 minutes, and I can also see the current server time date, if I again

When you refresh the page, the date will become if-modified-since in the image below and sent to the server to determine if the browser's cache has expired.

Finally the server found If-modified-since >= last-modifined time, the server also returned 304, but found this cookie information is really a thief more ah.

In actual development, if you know the Web site caching strategy, we can let the crawler 2min crawl once, of course, these are all can be configured by the data team maintenance, OK, the following we use a crawler simulation.
Copy Code code as follows:

Using System;
Using System.Net;

Namespace ConsoleApplication2
{
public class Program
{
static void Main (string[] args)
{
DateTime prevdatetime = Datetime.minvalue;

for (int i = 0; i < i++)
{
Try
{
var url = "Http://cnblogs.com";

var request = (HttpWebRequest) httpwebrequest.create (URL);

Request. method = ' head ';

if (i > 0)
{
Request. Ifmodifiedsince = Prevdatetime;
}

Request. Timeout = 3000;

var response = (HttpWebResponse) request. GetResponse ();

var code = Response. StatusCode;

If the server returns a status of 200, the Web page is considered updated, remember the time of the server
if (code = = Httpstatuscode.ok)
{
Prevdatetime = Convert.todatetime (response. Headers[httpresponseheader.date]);
}

Console.WriteLine ("Current server status code: {0}", code);
}
catch (WebException ex)
{
if (ex. Response!= null)
{
var code = (ex. Response as HttpWebResponse). StatusCode;

Console.WriteLine ("Current server status code: {0}", code);
}
}
}
}
}
}


Second: The Problem of page coding

Sometimes we have crawled to the Web page, ready to parse the time, the TMD is all garbled, really fucked, such as the following,


Perhaps we vaguely remember that in the HTML meta there is a property called CharSet, which records the encoding, and one of the main points is the response. CharacterSet This property also records the encoding method, let's try again below.

Incredibly or garbled, egg pain, this time need to go to the official website to look at, in the end HTTP header information inside the interaction of what, with what browser can be normal display, reptiles crawling over the not.

Looked at the HTTP header information, finally we know, the browser said I can parse gzip,deflate,sdch these three compression methods, the server is to send gzip compression, here we should also know the common Web performance optimization.
Copy Code code as follows:

Using System;
Using System.Collections.Generic;
Using System.Linq;
Using System.Text;
Using System.Threading;
Using Htmlagilitypack;
Using System.Text.RegularExpressions;
Using System.Net;
Using System.IO;
Using System.IO.Compression;

Namespace ConsoleApplication2
{
public class Program
{
static void Main (string[] args)
{
var currenturl = "http://www.mm5mm.com/";

var currenturl = "http://www.sohu.com/";

var request = WebRequest.Create (Currenturl) as HttpWebRequest;

var response = Request. GetResponse () as HttpWebResponse;

var encode = string. Empty;

if (response. CharacterSet = = "Iso-8859-1")
encode = "gb2312";
Else
Encode = response. CharacterSet;

Stream stream;

if (response. Contentencoding.tolower () = "gzip")
{
stream = new GZipStream (response. GetResponseStream (), compressionmode.decompress);
}
Else
{
Stream = Response. GetResponseStream ();
}

var sr = new StreamReader (stream, encoding.getencoding (encode));

var html = Sr. ReadToEnd ();
}
}
}


Third: Web Analytics

Since after arduous to get the Web page, the next will be resolved, of course, the matching is a good method, after all, the workload is still relatively large, perhaps the industry is also more respected htmlagilitypack this parsing tool, can parse HTML into XML, and then can use XPath to extract the specified content , greatly improve the speed of development, performance is not bad, after all, agility is also the meaning of agile, about the content of XPath, we can understand the w3cschool of these two pictures OK.


Copy Code code as follows:

Using System;
Using System.Collections.Generic;
Using System.Linq;
Using System.Text;
Using System.Threading;
Using Htmlagilitypack;
Using System.Text.RegularExpressions;
Using System.Net;
Using System.IO;
Using System.IO.Compression;

Namespace ConsoleApplication2
{
public class Program
{
static void Main (string[] args)
{
var currenturl = "http://www.mm5mm.com/";

var currenturl = "http://www.sohu.com/";

var request = WebRequest.Create (Currenturl) as HttpWebRequest;

var response = Request. GetResponse () as HttpWebResponse;

var encode = string. Empty;

if (response. CharacterSet = = "Iso-8859-1")
encode = "gb2312";
Else
Encode = response. CharacterSet;

Stream stream;

if (response. Contentencoding.tolower () = "gzip")
{
stream = new GZipStream (response. GetResponseStream (), compressionmode.decompress);
}
Else
{
Stream = Response. GetResponseStream ();
}

var sr = new StreamReader (stream, encoding.getencoding (encode));

var html = Sr. ReadToEnd ();

Sr. Close ();

HTMLDocument document = new HTMLDocument ();

Document. loadhtml (HTML);

Extract title
var title = document. Documentnode.selectsinglenode ("//title"). InnerText;

Extract keywords
var keywords = document. Documentnode.selectsinglenode ("//meta[@name = ' Keywords ']"). attributes["Content"]. Value;
}
}
}


All right, knock it off, sleep.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.