C # parsing HTML

Source: Internet
Author: User

When we need to parse a Web page, if it is very simple, you can use the string to find the way, complex can use regular expression, but sometimes it is cumbersome, because the HTML code itself is more cumbersome, like the usual IMG tags, This thing to the browser on the closed tag (has not yet understood why), want to use XML parsing, but also the same reason is not resolved, today found an analytic HTML control, with a bit, very useful.

This control is called HTML Agility Pack, home here: http://htmlagilitypack.codeplex.com/

Here's another blog. How to Use (English): http://olussier.net/2010/03/30/easily-parse-html-documents-in-csharp/

I'll just stick to the example here and I'll see. Because it is parsed as XML, so, without XPath, if you do not understand this thing, hurry to see it, now XPath syntax is extended to the CSS inside, the syntax is relatively simple, first look at the basis of the line.

The most basic way to use it is not selectsinglenode, but getElementById, which is different from XmlDocument.

//The Htmlweb class is a utility class to get the HTML over HTTPHtmlweb Htmlweb =NewHtmlweb (); //creates an HTMLDocument object from a URLHtmlagilitypack.htmldocument document = Htmlweb.load ("http://www.somewebsite.com"); //Targets a specific nodeHtmlnode Somenode = document. getElementById ("Mynode"); //If There is no node with the that Id, Somenode would be nullif(Somenode! =NULL){  //Extracts all links within that nodeienumerable"a"); //Outputs the href for external links  foreach(Htmlnode linkinchalllinks) {    //Checks Whether the link contains an HREF attribute    if(Link. Attributes.contains ("href"))    {      //Simple check:if The href begins with "http://", prints it out      if(Link. attributes["href"]. Value.startswith ("/ http") Console.WriteLine (link). attributes["href"].    Value); }  }}

Using XPath

//Extracts all links under a specific node that has an HREF, the begins with "http://"Htmlnodecollection alllinks = document. Documentnode.selectnodes ("//*[@id = ' Mynode ']//a[starts-with (@href, '/http ')]"); //Outputs the href for external linksforeach(Htmlnode linkinchalllinks) Console.WriteLine (link. attributes["href"]. Value);

One more

" //table[@id = ' 1 ' or @id = ' 2 ' or @id = ' 3 ']//a[@onmousedown] "  "//ul[@id = ' wg0 ']//li[position () <4]/h3/a""//div[@ Class= ' Resitem ' and position () <4]/a""//li[@class = ' result ' and Position () <4]/a";

How to use:

Just learned the XPath path expression, mainly to search the nodes in the XML document, through the XPath expression can quickly locate and access the node location in the XML document, HTML is also a kind of XML-like markup language, but the syntax is not so rigorous, In CodePlex, there is an open source project Htmlagilitypack, which provides an XPath parsing HTML file, which hides how to use the class library.

First, the XPath path expression

An XPath path expression

That is used to select the node or node set in the XML document.

1. Terminology: Node: 7 types: Elements, attributes, text, namespaces, processing commands, annotations, document (root) nodes

2. Node relationship: parent, child (children), Fellow (Sibling), ancestor (Ancestor), descendant (descendant)

3. Path expression

NodeName node name, select all child nodes of this node example: Childnode the Childnode child node in the current node, not including grandchildren

/select from root node example:/root/childnode/grandsonnode

Represents all descendant node examples://childnode all descendant nodes named Childnode

. Represents the current node example:./childnode represents the Childnode node of the current node

.. Represents the parent node example:.. /nearnode represents the Nearnode child node of the Father node

@ Select attribute/root/childnode/@id represents all node sets containing the id attribute of Childnode

4. Predicate (predicates)

Predicates can limit the set of nodes to make the selection more precise

/ROOT/BOOK[1] The first node in a node set

/root/book[last ()] The last node in a node set

/root/book[position ()-1] The second-to-last node set in a node set

/root/book[position () < 5] The first five node sets in a node set

/root/book[@id] Node set with attribute IDs in the nodes set

/root/book[@id = ' Chinese ' node set with id attribute value Chinese

/root/book[price > 35]/title node Set book's price element value greater than 35 title node

5. Wildcard: Wildcard characters are also supported in the XPath path (*,@*,node (), text ())

Example:/bookstore/*

Title[@*]

6. XPath axis

Defines a node set relative to the current node

Ancestor all ancestor nodes

attribute all attribute nodes

Child all children elements

Descendant all descendant nodes (child, grandchild ... )

following all nodes after the end tag preceding all nodes before the start tag

Following-sibling all sibling nodes after the end tag

Preceding-sibling all sibling nodes before the start tag

namespace all nodes of the current namespace

Parent Parents Node

Self current node

Usage: Axis name:: node test [predicate]

Example: Ancestor::book

Child::text ()

7. Operators

| Consolidated example of two node sets:/root/book[1] | /ROOT/BOOK[3]

+,-,*,dev,mod

=,!=,<,>,<=,>=

Or,and OR And with

//Delete comments, Script,stylenode. Descendants (). Where (n= = N.name = ="Script"|| N.name = ="style"|| n.name=="#comment")                . ToList (). ForEach (n=N.remove ()); //Traverse all descendant nodes of a node    foreach(varHtmlnodeinchnode. Descendants ()) {}

Htmlagilitypack Class Library Usage

 1, first need to get to the HTML page data, can be obtained through the WebRequest class

 Public Static stringGETHTMLSTR (stringURL) {                Try{WebRequest Rget=webrequest.create (URL); WebResponse RSet=Rget.getresponse (); Stream s=Rset.getresponsestream (); StreamReader Reader=NewStreamReader (S, Encoding.UTF8); returnReader.            ReadToEnd (); }            Catch(webexception) {//Connection Failed                return NULL; }        }

2. Loading HTML data through the HTMLDocument class

        stringHtmlstr = Gethtmlstr ("http://www.hao123.com"); Htmlagilitypack.htmldocument Doc=Newhtmlagilitypack.htmldocument (); Doc.        Loadhtml (HTMLSTR); Htmlnode RootNode= Doc. Documentnode;//An XPath path expression, which indicates that the last child node of a font in all span nodes is selected, where the span node's Class property value is num//set an XPath path expression based on the content of a Web page        stringXpathstring ="//span[@class = ' num ']/font[last ()]"; Htmlnodecollection AA= RootNode. SelectNodes (xpathstring);//all the found nodes are a collection                if(AA! =NULL)        {            stringInnerText = aa[0].            InnerText; stringcolor = aa[0]. Getattributevalue ("Color","");//Gets the Color property, the second parameter is the default value//other properties Everyone try it yourself}

You can also get HTMLDocument through the Htmlweb class.

  New Htmlweb ();         = web. Load (URL);         = Doc. Documentnode;

Add:

Multiple attribute criteria query//div[@align = ' center ' and @height = ' 24 ']

There is no class attribute//div[not (@class)]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.