. NET parsing HTML document class library Htmlagilitypack full usage instructions

Source: Internet
Author: User
Tags set cookie xpath

In the previous articles ([SouFun Real estate Data Acquisition Program Demo--geckowebbrowser instance]) There was a reference to a C # class library parsing HTML htmlagilitypack,

Today we finally have time to tidy up and share the demo.

Htmlagilitypack is a. NET-based, third-party, free, open-source, mini-class library for parsing HTML documents on the server side (in the B/s structure of the program, the client can parse HTML with JavaScript and jquery). By the time this article was published, the latest version of Htmlagilitypack was 1.4.6. :http://htmlagilitypack.codeplex.com/. The latest version supports LINQ to Objects (LINQ to XML).

Get ready:

If you have NuGet installed, you can find the installation directly.

After the download is uncompressed there are 3 files, here only need to HtmlAgilityPack.dll (assembly), Htmlagilitypack.xml (document, for visual Studio 2008 Code Intelligence Tips and help instructions for use in the solution, no need to install anything, very useful.

A using Htmlagilitypack is introduced at the beginning of a C # class file, so you can use the type under that namespace. In practice, almost all take the HTMLDocument class as the main line, which is very similar to the XmlDocument class in the Microsoft. NET Framework. The XmlDocument class is an XML document, and the HTMLDocument class operates on HTML documents (which can also manipulate XML documents), and they operate in a DOM-based way. The difference is that the latter cancels a method such as getElementsByTagName, reinforces the getElementById method (which can be used directly in HTMLDocument and XmlDocument not).

In Htmlagilitypack, the anchor nodes are basically XPath expressions, and the reference document of the XPath expression is visible: http://www.w3school.com.cn/xpath/xpath_syntax.asp. Learn by yourself.

But you can use a few simple first. For example, we use a div that is most likely to be for an element (div), or for a class attribute, or a div for an ID, or a div to start with,

XPath like this is still relatively simple.

XPath for a few examples, we'll use the following code:

"//comment ()" means "all comment nodes" in XPath

1, access to the Web Title:doc. Documentnode.selectsinglenode ("//title").  InnerText; Explanation: "//title" in XPath represents all the title nodes. The selectSingleNode is used to get the only node that satisfies the condition.

2. Get all Hyperlinks: Doc. Documentnode.descendants ("a")

3, obtain the name of kw input, which is equivalent to Getelementsbyname (): var kwbox = doc. Documentnode.selectsinglenode ("//input[@name = ' kw ']");

Explanation: "//input[@name = ' kw ']" is also the XPath syntax, indicating that the Name property equals the input label of KW.

li/h3/a[@href]: All Li below H3 contains a hyperlink with an href attribute to match. Some may be a supported JS event

Div[starts-with (@class, ' Content_single ')]: all qualifying div, and its class starts with the string content_single.

Mark to obtain all eligible conditions under Documet. The/div indicates the condition that starts with the root directory.

The above is the preparatory work. Let's say htmlagilitypack read the Web page and parse the method steps.

1. Read the URL:

Htmlagilitypack.htmlweb HW = new Htmlagilitypack.htmlweb ();

Htmlagilitypack.htmldocument DOCCC = HW. Load (URL);//is the URL you need to parse

ArrayList imageptahs = gethrefs (DOCCC);

There are 2 issues that you may encounter here.

One is the coding problem, and one is the problem that gzip does not support.

First, the coding problem solution: is not htmlagilitypack to get the URL of the data, own access. You may have asked: I got it myself, he didn't give me an analysis of that?

It's okay, he's not that stupid. Whose meat is not to eat?

Here's how:

WebProxy proxyobject = new WebProxy (IP, port);//Here I am using the agent.

Send a request to a specified address

HttpWebRequest Httpwreq = (HttpWebRequest) webrequest.create (URL);

Httpwreq.proxy = Proxyobject;

Httpwreq.timeout = 10000;

HttpWebResponse Httpwresp = (HttpWebResponse) httpwreq.getresponse ();

StreamReader sr = new StreamReader (Httpwresp.getresponsestream (), System.Text.Encoding.GetEncoding ("UTF-8"));

Are you aware of the above code??

Htmlagilitypack.htmldocument doc = new htmlagilitypack.htmldocument ();

Doc. Load (SR);

int res = Checkisgoodproxy (DOC); This is my analytic function, not yet. Don't explain it.

Sr. Close ();

Httpwresp.close ();

Httpwreq.abort ();

Another problem is that it's strange. Gzip error when you turn on gzip compressed Web requests. the error message is "gzip" is not a supported encoding name.

search on Google for half a day, finally find a solution, and do not have to replace HttpRequest or WebClient to make the request. You can also use this method to set Cookie,render camouflage and so on ...
The following code is resolved: When you initiate the request is modified.

Htmlweb webClient = new Htmlweb ();

HtmlAgilityPack.HtmlWeb.PreRequestHandler handler = delegate (HttpWebRequest request)

{

Request. Headers[httprequestheader.acceptencoding] = "gzip, deflate";

Request. Automaticdecompression = Decompressionmethods.deflate | Decompressionmethods.gzip;

Request. Cookiecontainer = new System.Net.CookieContainer ();

return true;

};

Webclient.prerequest + = handler;

HTMLDocument doc = Webclient.load (This.geturl ());

Perhaps the latest version of Htmlagilitypack will fix the problem. Look forward to the.

2. Parse with XPath.

This step is relatively simple. Just use XPath to select the data you want, traverse them, and take out their value.

Instance code:

Private ArrayList gethrefs (htmlagilitypack.htmldocument _doc)

{

Try

{

Images = new ArrayList ();

Htmlnodecollection hrefs = _doc. Documentnode.selectnodes ("//li/h3/a[@href]");

Htmlnodecollection HREFS2 = _doc. Documentnode.selectnodes ("//div[starts-with (@class, ' Content_single ')]");

if (HREFs = = null)

return new ArrayList ();

foreach (Htmlnode href in hrefs)

{

Images.add (href. attributes["src"]. Value);

String Hreff = href. attributes["href"]. value;//ruled out Bohai No. 202 stage "after eating the French fries Lonely

string title = href. attributes["title"]. Value;

if (title. IndexOf ("evil") >= 0)

{

Continue

}

if (title. IndexOf ("spoof") >= 0)

{

Continue

}

if (title. IndexOf ("ridiculous") >= 0)

{

Continue

}

Logic to perform data preservation

}

}

catch (Exception ex)

{

Showlogmsg ("Error:" +ex. Message+ex. StackTrace);

return new ArrayList ();

}

}

Every htmlnode, you want to get his data in this way: IMG. attributes["src"]. Value

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.