1. Htmlagilitypack Introduction
The first problem encountered in the Web site is the problem of crawling and parsing HTML, generally in the case of obtaining a small amount of information on the page, we can use the regular to accurately match the target. However, the regular expression is more complex, while the accuracy of the regular expression is difficult to take, too accurate and the original Web page coupling is too serious, the page code changes will make the regular invalid; too broad a regular is likely to match too many targets. So what we're going to introduce today is--htmlagilitypack the way to get the target by parsing the HTML structure.
Htmlagilitypack is a class library that parses HTML, supports parsing of HTML with XPath, and can parse HTML like XML.
Htmlagilitypack's code is hosted on CodePlex: http://htmlagilitypack.codeplex.com/, but it is recommended to get the latest version through NuGet.
2. Introduction to XPath
XPath is the XML Path language, which is a language used to determine the location of a part of an XML document. XPath is an XML-based tree structure that provides the ability to find nodes in a data structure tree. Enumerates the main path expressions for XPath:
This path to XML can be used in parsing HTML because htmlagilitypack will download the HTML page to normalize, so that the original semantic support and bad HTML document format into a more rigorous XHTML format, and even can be converted to XML format and use XPath to select and manipulate the element in the DOM. Represents a node after HTML formatting:
3. APIs commonly used in Htmlagilitypack
The classes commonly used in Htmlagilitypack are HTMLDocument, Htmlnodecollection, Htmlnode, and Htmlweb.
The first is to load the HTML, if it is already existing static HTML code, can be loaded with htmldocument load () or loadhtml (), if the URL on the network will need to use the Htmlweb get () or load () method to load.
Regardless of the type of loading, all we get is an instance of HTMLDocument. At this point we need to get a htmlnode or Htmlnodecollection object, using the Documentnode property of HTMLDocument, which is the root node of the entire HTML document, which itself is also a htmlnode.
After you get the document root node, you can use the XPath described in the previous section to get information about any node in the document you want.
Here is a typical example of getting a valid content:
1234 |
HtmlWeb htmlWeb = new HtmlWeb(); HtmlDocument htmlDoc = htmlWeb.Load( "http://www.baidu.com" ); HtmlNode htmlNode = htmlDoc.DocumentNode.SelectSingleNode( "//title" ); string title = htmlNode.InnerText; |
Since the most used class is Htmlnode, here is a list of the properties and methods that are commonly used here for your convenience.
Property:
Attributes get a collection of properties for a node
ChildNodes getting a collection of child nodes (including text nodes)
FirstChild Get first child node
HasAttributes Determines whether the node contains attributes
HasChildNodes determine if the node contains child nodes
ID gets the id attribute of the node
InnerHtml gets the HTML code for the node
InnerText gets the content of the node, unlike innerHTML, where it filters out HTML code, and innerHTML is output with HTML code.
LastChild gets the last child node
Name HTML element name
NextSibling Get next sibling node
ParentNode Gets the parent node of the node
PreviousSibling Get previous sibling node
XPath returns the XPath of the node based on the node
Method:
Htmlnode appendchild (Htmlnode newChild); Appends a parameter element to the child element of the calling element (appended to the last)
void Appendchildren (Htmlnodecollection newchildren); Appends the elements in the parameter collection to the child elements of the calling element (appended to the last)
Htmlnode PrependChild (Htmlnode newChild); Place the element in the parameter as a child element at the front of the calling element
void Prependchildren (Htmlnodecollection newchildren); Place all elements in the parameter collection as child elements, before the calling element
Htmlnode Clone (); This node is cloned to a new node
Htmlnode CloneNode (bool deep); The node is cloned to a new point, and the parameter determines whether the child element is cloned together
Htmlnode CloneNode (string newName); Changing the name of an element while cloning
Htmlnode CloneNode (String newName, bool deep); The clone changes the element name at the same time. parameter determines whether a child element is cloned together
void CopyFrom (Htmlnode node); Creates a duplicate node and a subtree below it.
void CopyFrom (Htmlnode node, bool deep); Creates a copy of the node.
Static Htmlnode CreateNode (string html); Static method that allows a new node to be created with a string
IenumerableIenumerableienumerableienumerableIenumerableIenumerableHtmlnode Element (string name); Gets an element based on the name of the parameter
IenumerableBOOL Getattributevalue (string name, bool def); The Help method that gets the value of the property of this node (the Boolean type). If the property is not found, the default value is returned.
int Getattributevalue (string name, int def); The Help method that gets the value of the property of this node (integer). If the property is not found, the default value is returned.
String Getattributevalue (string name, String def); The Help method that gets the value of the property of this node (string type). If the property is not found, the default value is returned.
Htmlnode InsertAfter (Htmlnode newChild, Htmlnode refchild); Inserts a node after the second parameter node, and the second argument is a sibling
Htmlnode insertbefore (Htmlnode newChild, Htmlnode refchild); Say a node is inserted after the second parameter node, and the second argument is a sibling
static bool Iscdataelement (string name); Determines whether an element node is a CDATA element node.
static bool Isclosedelement (string name); Determines whether the element node is closed
static bool Isemptyelement (string name); Determines whether an empty element node.
static bool Isoverlappedclosingelement (string text); Determines whether the text corresponds to a node that can retain overlapping closing tags.
void Remove (); Remove the call node from the parent collection
void RemoveAll (); Removes all child nodes of the calling node and attributes
void Removeallchildren (); Remove all child nodes from the calling node
Htmlnode removechild (Htmlnode oldchild); Removes the child node of the specified name of the calling node
Htmlnode removechild (Htmlnode oldchild, bool keepgrandchildren); Removes a child node of the calling node's name, and the second parameter determines whether to remove the grandchild node
Htmlnode replacechild (Htmlnode newChild, Htmlnode oldchild); Replace the original child node of the calling node with a new one, and the second parameter is the old node
Htmlnodecollection selectnodes (string XPath); Gets a collection of nodes based on XPath
Htmlnode selectSingleNode (string XPath); Get a unique node based on XPath
Htmlattribute setattributevalue (string name, string value); To set the properties of a call node
String WriteContentTo (); Saves all the children of the node in a string.
void WriteContentTo (TextWriter outtext); Saves all the children of the node to the specified TextWriter.
String WriteTo (); Saves the current node to a string.
void WriteTo (TextWriter outtext); Saves the current node to the specified TextWriter.
void WriteTo (XmlWriter writer); Saves the current node to the specified XmlWriter.
4. Actual combat
The basics are familiar, let's do the exercises. The code snippet is as follows:
1HTMLDocument doc =NewHTMLDocument ();2Doc. loadhtml (HTML);//Loading HTML3 stringPagenumberpath =@"//*[@id = ' j_toppage ']/span/i";4Htmlnode Pagenumbernode =Doc. Documentnode.selectsinglenode (Pagenumberpath);5 if(Pagenumbernode! =NULL)6 {7 stringSnumber =Pagenumbernode.innertext;8 for(inti =1; I <int. Parse (Snumber) +1; i++)9 {Ten stringPageurl =string. Format ("{0}&page={1}", category. URL, i); One Try A { -list<commodity> commoditylist = getcommoditylist (category, Pageurl.replace ("&page=1&",string. Format ("&page={0}&", i))); - commodityrepository.savelist (commoditylist); the } - Catch(Exception ex)//Make sure one page of errors does not affect another page - { -Logger. Error ("crawler commodityrepository.savelist (commoditylist) has an exception", ex); + } -}
5. PostScript
Htmlagilitypack is really a powerful HTML parsing class library, and I'm only using a small subset of its features, but I'm already fully satisfied with my needs. If children's shoes have similar needs, you can try them.
HTML Parsing class Library Htmlagilitypack