"Go" htmlagilitypack use--xpath precautions

Source: Internet
Author: User
Tags xpath

"Go" htmlagilitypack use--xpath precautions

It is very convenient to use Htmlagilitypack as an open source class library for Web content parsing (see another blog, "HTML parsing: XPath-based C # class Library Htmlagiliytypack"). It is based on the XPath path syntax for efficient selection of the document node, when the request to obtain the Web page HTML file, the most of the work of parsing fell to the XPath path expression of the writing. This document is tested in the VS2010 development environment. NETFramework 4.0 C # language, use the following HTML:

1.htmlagilitypack Node Type

When using XPath expressions to select a specific node of a document, I find that sometimes the path expression written in context is invalidated, or the error is selected to the wrong content. Or, because selectSingleNode or selectnodes this two function query with the corresponding XPath expression can not find the results and run out of the exception. It was later found that Htmlagilitypack's selection of nodes was implemented strictly in accordance with the XPath specification, with seven types of nodes strictly defined in the XPath specification (http://www.w3school.com.cn/xpath/xpath_ nodes.asp): Elements (Element), attributes (Attribute), text (Test), namespaces, processing instructions, annotations, document root nodes. The base value is a node that has no parent or no child, and the project (item) is a base value or node, and then the relationships between the parent, the sibling, the ancestors, and the descendants. Each Htmlnode object in the Htmlagilitypack encapsulates all of the above specification-defined items, which are the content contained in a node object.


Because of this, in writing the XPath path expression need to take into account that htmlagilitypack the text also as a node, so for our general sense of the HTML structure to be considered more than once the text node, empty text node is also counted inside, This is a special case under JavaScript ie, which needs to write compatible JS code specifically for IE browsers. Use the following C # code to output an empty string, which is an empty text node.

    HTMLDocument doc = new HTMLDocument ();            Doc. Load (@ "C:\test.html");            Htmlnode main = doc. getElementById ("content");            Htmlnode child = Main. FirstChild;            Console.WriteLine (Child. InnerText);

The output of the above results is null:


This also verifies that the FirstChild node of the div node that is selected as the content of the ID is an empty text node. Therefore, for FirstChild, LastChild, NextSibling, prevsibling and other relationships expressed by the nodes need to use caution, you need to consider the empty text node.

2. Deep understanding of "//" and "./"

The most critical choices in an XPath path expression are

An expression Description
NodeName Select all child nodes of this node.
/ Select from the root node.
// Selects the nodes in the document from the current node that matches the selection, regardless of their location.
. Select the current node.
.. Selects the parent node of the current node.
@ Select the attribute.

This is also the basis of the writing path expression, where after selecting a node, using the "//" and "./" Two syntax is always confusing. This article actually tests the difference between the two.

    • "//": Search from the node currently selected, for subsequent expressions are found anywhere in the current node, as long as the matching is added to the selection results.
    • "./": It is also selected from the currently selected node, but only the immediate child elements of the current node are searched, but not for grandchildren or future nodes.
The following test code is used for the difference:
            HTMLDocument doc = new HTMLDocument ();            Doc. Load (@ "C:\test.html");            Htmlnode main = doc. getElementById ("content");            Htmlnodecollection nodes = Main. SelectNodes ("./div");            foreach (Htmlnode node in nodes)            {                Console.WriteLine ("=============start=============");                Console.WriteLine (node. InnerText);                Console.WriteLine ("=============end===============");            }
The results of the above input are as follows:
Proceed with the test using the following code:
            HTMLDocument doc = new HTMLDocument ();            Doc. Load (@ "C:\test.html");            Htmlnode main = doc. getElementById ("content");            Htmlnodecollection nodes = Main. SelectNodes ("//div");            foreach (Htmlnode node in nodes)            {                Console.WriteLine ("=============start=============");                Console.WriteLine (node. InnerText);                Console.WriteLine ("=============end===============");            }

Through the above tests, it can be seen that the choice of the path expression requires a special distinction between the two above, so that the results can be achieved is accurate.
            HTMLDocument doc = new HTMLDocument ();            Doc. Load (@ "C:\test.html");            Htmlnode main = doc. getElementById ("content");            Htmlnode Node1 = Main. selectSingleNode ("//div[1]/div[2]");            Htmlnode Node2 = Main. selectSingleNode ("./div[1]/div[2]");            Console.WriteLine (Node1. InnerText);            Console.WriteLine (Node2. InnerText);


The results show that the results may be the same when selected according to the path expression and therefore need to be treated according to the specific situation.

The above is my use of Htmlagilitypack parsing HTML in the process of writing XPath expressions to derive some understanding, I hope that useful friends can discuss the exchange together.


"Go" htmlagilitypack use--xpath precautions

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.