Html Agility pack:http://htmlagilitypack.codeplex.com/
Html Agility Pack Source class about 28 or so, in fact, not a very complex class library, but its function is not weak, to parse the DOM has provided strong enough functionality to support, can be compared with the jquery operation Dom:)
Introduction to Basic classes and basic methods
The most common base class for Html Agility pack is not much, and for parsing the DOM, there are only two commonly used classes, HTMLDocument and Htmlnode, and a Htmlnodecollection collection class.
HTMLDocument class
Of course, before parsing the DOM needs to load the HTML original file or HTML string, the HTMLDocument class encapsulates the method to support this function, the following is the method of loading HTML introduction.
The HTMLDocument class defines multiple overloaded load methods to implement loading HTML in different ways, in fact there are two main types, one is to load HTML from the stream, and the other is to load the HTML from the physical path, see below:
method : public void Load (TextReader reader)
description : Loading HTML from the specified TextReader object
Example :
htmldcument Doc=NewHTMLDocument ();
StreamReader SR=File.OpenText ("file path");
Doc. Load (SR);
Based on the above method, several different overloaded methods are derived.
The specified stream object is dominated by:
(1) public void load (stream stream)///load HTML from the specified stream object;
(2) public void Load (stream stream, bool detectencodingfrombyteordermarks)///Specifies whether to parse the encoding format from the sequential byte stream
(3) public void Load (Stream stream, Encoding Encoding)///Specify encoding format
(4) public void Load (Stream stream, Encoding Encoding, bool detectencodingfrombyteordermarks)
(5) public void Load (Stream stream, Encoding Encoding, bool detectencodingfrombyteordermarks, int buffersize)
The main physical paths specified are:
(1) public void Load (string path)
(2) public void Load (string path, bool Detectencodingfrombyteordermarks)///Specifies whether to parse the encoding format from the sequential byte stream
(3) public void Load (string path, Encoding Encoding)///Specify encoding format
(4) public void Load (string path, Encoding Encoding, bool detectencodingfrombyteordermarks)
(5) public void Load (string path, Encoding Encoding, bool detectencodingfrombyteordermarks, int buffersize)
The HTMLDocument class also defines the loading of HTML directly from an HTML string, as follows:
method : public void loadhtml (string html)
description : Loading HTML from the specified HTML string
Example :
HTMLDocument Doc=NewHTMLDocument ();
stringHTML="<div id="Demo"><span style="color:red;">";
Doc. loadhtml (HTML);
The HTMLDocument class also has other definitions for writing DOM methods, which are not described in detail here, and are reserved for later introduction of the HTML Agility pack to write the DOM chapter, which focuses on the details of the HTML Agility Pack parsing dom.
Htmlnode class and Htmlnodecollection class
What does it take to get HTML loaded through HTMLDocument? Of course, parsing the HTML, parsing the DOM requires mentioning the Htmlnode class. The HTMLDocument class is returned by the property Documentnode property to a global Htmlnode object after the current HTML parsing, and if you want to get the htmlnode of an element, you can htmldocument the getElementById class (String Id) method to get a Htmlnode object that specifies an HTML element. How do I access the DOM through the Htmlnode object? Before the introduction, let's look at its features.
The Htmlnode class implements the IXPathNavigable interface, which means that it can query the DOM through XPath, if it knows about the XmlDocument class under the System.Xml namespace, especially if the selectnodes () is used. Friends of the selectSingleNode () method will be familiar with using the Htmlnode class. In fact, HTML Agility pack inside is to parse HTML into XML document format, so support some common query methods in XML. The following is a brief description of some of the main common members of Htmlnode.
Main properties of the Htmlnode class
1) Attributes property
Gets a collection of the properties of the current HTML element, returning a Htmlattributecollection object. such as a DIV element, it may define some properties, such as: <div id= "title" Name= "title" class= "Class-name" title= "title div" >***</DIV> The htmlattributecollection that attributes returns contains the information "Id,name,class,title". The Htmlattributecollection class is a collection class that implements the interface ilist
htmlnode Node=Doc. getElementById ("title");
stringTitlevalue=node. attributes["title"]. Value;
Or
foreach(Htmlattribute attrinchNode. Attributes)
{
Console.WriteLine ("{0}={1}", attr. Name,attr. Value);
}
When a property value is obtained, if a property name does not exist, attributes["name" returns a null value.
2) Firstchild,lastchild,childnodes,parentnode Property
FirstChild property: Returns the first node of all child nodes, as in the following code:
stringHTML="<div id="Demo"><span style="color:red;">"Innerdiv">inner div</div></div>";
The FirstChild returned "<span style=" color:red; >
LastChild property: Returns the last node of all child nodes, with the HTML above as an example, and returns the "<div id=" Innerdiv ">inner div</div>" node.
ChildNodes property: Returns the collection of child nodes of all direct generations of the current node, excluding the cross-generational child nodes, and returns the "<span style=" color:red, taking the above HTML as an example; >
ParentNode property: Returns the immediate parent node of the current node.
3) Get HTML source code and text
The Htmlnode class designed the outerHTML property and the innerHTML property to get the HTML source of the current node. The difference is that the outerHTML property returns all the HTML code that contains the HTML code for the current node, and the innerHTML property returns all the HTML code for the face node in the current node. As below:
HTMLDocument Doc=NewHTMLDocument ();
stringHTML="<div id="Demo"><span style="color:red;">";
Doc. loadhtml (HTML);
Htmlnode node=Doc. HTMLDocument;
Console.WriteLine (node. outerHTML); ///return "<div id= "demo" ><span style= "color:red;" >Hello world!";
Console.WriteLine (node. InnerHtml); ///return "<span style= "color:red;" >Hello world!";
To get the text value of a node, obtained through the InnerText property, the InnerText property filters out all HTML markup code, returning only the text value, as follows:
Console.WriteLine (node. InnerText);///return "Hello world!";The main methods of the Htmlnode class
The Htmlnode class provides a rich enough way to query the child nodes (elements) under the current node, including, of course, the method of querying the parent node (element) of the current node, with the main methods and usage instructions listed below.
Gets the family method of the parent node:
1) Public ienumerable
Gets a list of the parent nodes of the current node (without itself).
2) Public ienumerable
To specify a name to get a list of parent nodes (without themselves).
3) Public ienumerable
Gets a list of the parent nodes of the current node, including itself.
4) Public ienumerable
To specify a name to get the list of parent nodes (including itself).
To get the family of child nodes method:
1) Public ienumerable
Gets a list of all child nodes under the current node, including child nodes (not including themselves).
2) Public ienumerable
Gets a list of all child nodes under the current node, including child nodes (containing themselves).
3) Public ienumerable
Gets the list of immediate child nodes under the current node (without itself).
4) Public ienumerable
Gets a list of the immediate child nodes under the current node, including itself.
5) Public ienumerable
Gets a list of child nodes under the current node with the specified name.
6) Public ienumerable
Gets a list of child nodes under the current node with the specified name, including itself.
7) Public Htmlnode Element (string name)
Gets the first node element that matches a direct child node of the specified name.
8) Public ienumerable
Gets a list of nodes for all immediate child nodes that match the specified name.
9) Public htmlnodecollection selectnodes (string XPath)
Gets a list of child nodes that conform to the specified XPath.
Public Htmlnode selectSingleNode (string XPath)
Gets a single byte point element that conforms to the specified XPath.
The method of querying the node is mainly the above 10 methods, the class also has other Write node series method, here does not introduce the method of writing operation, leave for detailed introduction later.
Querying nodes with XPath is a powerful feature, which is as convenient as manipulating XML.
Code for a simple example
The following example is the blog Park's Essence of the blog list query out. The execution results are as follows:
Code
usingSystem;
usingSystem.Collections.Generic;
usingSystem.Linq;
usingSystem.Text;
usingSystem.IO;
usingHtmlagilitypack;
namespaceDemocnblogs
{
classProgram
{
StaticvoidMain (string[] args)
{
Htmlweb Web=NewHtmlweb ();
HTMLDocument Doc=web. Load ("http://www.cnblogs.com/pick/");
Htmlnode node=Doc. getElementById ("post_list");
StreamWriter SW=File.createtext ("Log.txt");
foreach(Htmlnode Childinchnode. ChildNodes)
{
if(child. attributes["class"] ==NULL||Child . attributes["class"]. Value!="Post_item")
Continue;
Htmlnode HN=Htmlnode.createnode (child. outerHTML);
///If you use child. selectSingleNode ("//*[@class =\" titlelnk\ "]"). InnerText such a way to query, is always the entire document as the benchmark to query,
///This is not good, it should be based on the current child node HTML as a benchmark.
Write (SW, String.Format ("recommended: {0}", HN. selectSingleNode ("//*[@class =\"diggnum\"]"). InnerText));
Write (SW, String.Format ("title: {0}", HN. selectSingleNode ("//*[@class =\"titlelnk\"]"). InnerText));
Write (SW, String.Format ("description: {0}", HN. selectSingleNode ("//*[@class =\"post_item_summary\"]"). InnerText));
Write (SW, String.Format ("info: {0}", HN. selectSingleNode ("//*[@class =\"post_item_foot\"]"). InnerText));
Write (SW,"----------------------------------------");
}
Sw. Close ();
Console.ReadLine ();
}
StaticvoidWrite (StreamWriter writer,stringstr)
{
Console.WriteLine (str);
Writer. WriteLine (str);
}
}
}
Transferred from: http://www.cnblogs.com/huangcong/p/3408309.html
"Turn" C # HTML parsing classes like jquery Htmlagilitypack Basic class Introduction and application