C # introduction and application of html parsing class HtmlAgilityPack similar to Jquery,

Source: Internet
Author: User

[Switch] C # introduction and application of html parsing class HtmlAgilityPack similar to Jquery,

Html Agility Pack: http://htmlagilitypack.codeplex.com/

 

There are about 28 classes in the Html Agility Pack source code, which is not a very complex class library, but its functions are not weak, providing enough powerful function support for parsing DOM, it is comparable to jQuery's DOM operation :)

Introduction to basic classes and basic methods

There are not many basic classes commonly used in Html Agility Pack. For DOM parsing, there are only two common classes, HtmlDocument and HtmlNode, and an HtmlNodeCollection class.

 

HtmlDocument class

Of course, you need to load the original html file or html string before parsing the DOM. The HtmlDocument class encapsulates methods that support this function. The following describes how to load html.


The HtmlDocument class defines multiple overload Load methods to Load html in different ways. In fact, there are two main types: loading html from Stream, the other is to load html from a physical path, as shown below:


Method: Public void Load (TextReader reader)
Description: Loads Html from the specified TextReader object
Example:

 

HtmlDcument doc = new HtmlDocument ();

StreamReader sr = File. OpenText ("file path ");

Doc. Load (sr );

 

 


Based on the above method, several different overload methods are derived.

The main Stream objects are:

(1) public void Load (Stream stream) // loads html from the specified Stream object;

(2) public void Load (Stream stream, bool detectEncodingFromByteOrderMarks) // specifies whether to parse the encoding format from the sequential byte Stream

(3) public void Load (Stream stream, Encoding encoding) // specify the Encoding format

(4) public void Load (Stream stream, Encoding encoding, bool detectEncodingFromByteOrderMarks)

(5) public void Load (Stream stream, Encoding encoding, bool detectEncodingFromByteOrderMarks, int buffersize)


The main physical paths are:

(1) public void Load (string path)

(2) public void Load (string path, bool detectEncodingFromByteOrderMarks) // specify whether to parse the encoding format from the sequential byte stream

(3) public void Load (string path, Encoding encoding) // specify the Encoding format

(4) public void Load (string path, Encoding encoding, bool detectEncodingFromByteOrderMarks)

(5) public void Load (string path, Encoding encoding, bool detectEncodingFromByteOrderMarks, int buffersize)

 

The HtmlDocument class also defines loading html directly from an Html string, as follows:


Method: Public void LoadHtml (string html)
Description: Loads html from a specified html string
Example:

 

HtmlDocument doc = new HtmlDocument ();

String html = "<div id =" demo "> <span style =" color: red; ">
Doc. LoadHtml (html );

 

 


The HtmlDocument class has other DOM writing methods. I will not introduce them in detail here. I will introduce the Html Agility Pack DOM writing chapter later. Here I will introduce the details of Html Agility pack's DOM parsing.

 

HtmlNode class and HtmlNodeCollection class


After html is loaded in through HtmlDocument, what should we do next? Of course, html is parsed, And the HtmlNode class needs to be mentioned when DOM is parsed. The HtmlDocument class returns a global HtmlNode object after Html parsing by the DocumentNode attribute. To obtain the HtmlNode of an element, you can use the GetElementbyId (string Id) of the HtmlDocument class) returns the HtmlNode object of a specified html element. How can I access the DOM through the HtmlNode object? Before the introduction, let's take a look at its functions.


The HtmlNode class implements the IXPathNavigable interface, which indicates that it can query the DOM through xpath. I understand the XmlDocument class in the Xml namespace, especially those who have used the SelectNodes () and SelectSingleNode () methods will be familiar with using the HtmlNode class. In fact, Html Agility Pack parses html into xml document formats, so it supports some common query methods in xml. This section briefly describes some common HtmlNode members.

 

Main attributes of the HtmlNode class

1) Attributes

Gets the set of attributes of the current Html element, and returns an HtmlAttributeCollection object. For example, a div element may define some attributes, such: <div id = "title" name = "title" class = "class-name" title = "title div"> *** </div>, the HtmlAttributeCollection returned by Attributes contains "id, name, class, title" information. The HtmlAttributeCollection class is a collection class that implements the interface IList <HtmlAttribute>. Therefore, you can access each member using the following code.

 

HtmlNode node = doc. GetElementbyId ("title ");

String titleValue = node. Attributes ["title"]. Value;

 


Or

 

Foreach (HtmlAttribute attr in node. Attributes)

{

Console. WriteLine ("{0 }={ 1}", attr. Name, attr. Value );

}

 


If a property name does not exist when you obtain the property value, Attributes ["name"] returns null.


2) FirstChild, LastChild, ChildNodes, and ParentNode attributes


FirstChild attribute: returns the first node of all child nodes, as shown in the following code:


String html = "<div id =" demo "> <span style =" color: red; ">  

FirstChild returns "<span> HtmlDocument doc = new HtmlDocument ();

String html = "<div id =" demo "> <span style =" color: red; ">
Doc. LoadHtml (html );


HtmlNode node = doc. HtmlDocument;

Console. WriteLine (node. OuterHtml); // return "<div id =" demo "> <span> Hello World! </H1> </span> </div> ";
Console. WriteLine (node. InnerHtml); // return "<span> Hello World! </H1> </span> ";


To obtain the text value of a node, use the InnerText attribute. The InnerText attribute filters out all Html code and returns only the text value, as shown below:


Console. WriteLine (node. InnerText); // return "Hello World! "; Main methods of the HtmlNode class

The HtmlNode class provides rich enough methods to query subnodes (elements) under the current node. Of course, it also includes the method to query the parent node (elements) of the current node, the main methods and instructions are listed below.


How to obtain the series of parent nodes:

1) public IEnumerable <HtmlNode> Ancestors ()

Obtains the list of parent nodes of the current node (excluding itself ).

2) public IEnumerable <HtmlNode> Ancestors (string name)

Obtain the list of parent nodes (excluding itself) by specifying a name ).

3) public IEnumerable <HtmlNode> AncestorsAndSelf ()

Obtains the list of parent nodes (including itself) of the current node ).

4) public IEnumerable <HtmlNode> AncestorsAndSelf (string name)

Obtain the list of parent nodes (including itself) by specifying a name ).

Methods for getting subnodes:

1) public IEnumerable <HtmlNode> DescendantNodes ()

Obtains the list of all subnodes under the current node, including the subnodes (excluding themselves) of the subnode ).

2) public IEnumerable <HtmlNode> DescendantNodesAndSelf ()

Obtains the list of all subnodes under the current node, including the subnodes (including themselves) of the subnode ).

3) public IEnumerable <HtmlNode> Descendants ()

Obtain the list of direct subnodes under the current node (excluding itself ).

4) public IEnumerable <HtmlNode> DescendantsAndSelf ()

Obtain the list of direct subnodes under the current node (including itself ).

5) public IEnumerable <HtmlNode> Descendants (string name)

Obtains the list of subnodes with the specified name under the current node.

6) public IEnumerable <HtmlNode> DescendantsAndSelf (string name)

Obtains the list of child nodes under the current node with the specified name (including itself ).

7) public HtmlNode Element (string name)

Obtain the first node element that matches the specified name.

8) public IEnumerable <HtmlNode> Elements (string name)

Obtains the list of all direct subnodes with the specified name.

9) public HtmlNodeCollection SelectNodes (string xpath)

Obtains the list of child nodes that match the specified xpath.

10) public HtmlNode SelectSingleNode (string xpath)

Obtains a single byte element that conforms to the specified xpath.


The method for querying nodes is mainly the above 10 methods. This class also has a series of other writing node methods. Here we will not detail the writing operation methods.

It is quite powerful to query nodes with Xpath, which is as convenient as operating xml.


Code for a simple example

The following example shows how to query the list of blogs in the essence area of the blog Park. The execution result is as follows:

 


Code

 

Using System;
Using System. Collections. Generic;
Using System. Linq;
Using System. Text;
Using System. IO;
Using HtmlAgilityPack;


Namespace DemoCnBlogs
{
Class Program
{
Staticvoid Main (string [] args)
{
HtmlWeb web = new HtmlWeb ();
HtmlDocument doc = web. Load ("http://www.cnblogs.com/pick ");

HtmlNode node = doc. GetElementbyId ("post_list ");

StreamWriter sw = File. CreateText ("log.txt ");

Foreach (HtmlNode child in node. ChildNodes)
{
If (child. Attributes ["class"] = null | child. Attributes ["class"]. Value! = "Post_item ")
Continue;
HtmlNode hn = HtmlNode. CreateNode (child. OuterHtml );

/// If child. selectSingleNode ("// * [@ class = \" titlelnk \ "]"). innerText is always queried based on the entire document,
/// This is not good. It should be based on the html of the current child node.

Write (sw, String. format ("recommended: {0}", hn. selectSingleNode ("// * [@ class = \" diggnum \ "]"). innerText ));
Write (sw, String. format ("title: {0}", hn. selectSingleNode ("// * [@ class = \" titlelnk \ "]"). innerText ));
Write (sw, String. format ("Introduction: {0}", hn. selectSingleNode ("// * [@ class = \" post_item_summary \ "]"). innerText ));
Write (sw, String. format ("information: {0}", hn. selectSingleNode ("// * [@ class = \" post_item_foot \ "]"). innerText ));

Write (sw ,"----------------------------------------");

}

Sw. Close ();

Console. ReadLine ();
}

Staticvoid Write (StreamWriter writer, string str)
{
Console. WriteLine (str );
Writer. WriteLine (str );
}


}
}

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.