Python XPath basic usage

Source: Internet
Author: User
Tags cdata processing instruction types of functions xpath contains

Transferred from: http://www.pythoner.cn/home/blog/python-xpath-basic-usage/

    • Pyer found
    • Industry News
    • Album
      • 7th issue: Pythoner Technology Exchange Salon
    • About Us
    • Contact Us
Date: PYTHONERCN 8 months, 3 weeks ago

In the Web page crawl, the analysis of the location of the HTML node is the key to capture information, I am using the lxml module (to analyze the structure of XML documents, of course, can also analyze the HTML structure), using its lxml.html XPath to analyze the HTML to obtain crawling information; Here are some basic uses of XPath:

Before we introduce the matching rules for XPath, let's look at some basic concepts about XPath. The first thing to say is the XPath data type. XPath can be divided into four types of data:
Node Set (node-set)
A node set is a collection of a set of nodes that are returned by a path match that matches the criteria. Other types of data cannot be converted to node-sets.

Boolean Value (Boolean)
The condition-matching value returned by a function or Boolean expression is the same as a Boolean value in a generic language, with a value of true and false two. Boolean values can be converted to and from numeric types, string types.

Strings (String)
A string is a collection of a series of characters, and a series of string functions are provided in XPath. Strings can be converted to data of numeric types, Boolean types, and other values.

Value (number)
In XPath, the value is a floating-point number, which can be a double-precision 64-bit floating point. Also includes a number of special descriptions, such as non-numeric Nan (not-a-number), positive infinity infinity, negative infinity-infinity, plus or minus 0, and so on. The integer value of number can be obtained from the function, and the value can be converted to Boolean type and string type.

The latter three data types are similar to the corresponding data types in other programming languages, except that the first data type is a unique product of the XML document tree. Also, because XPath contains a series of operations on the document tree, it is also necessary to figure out the XPath node type. Because of the logical structure of an XML document, an XML file can contain logical features, such as elements, CDATA, annotations, processing instructions, which can also contain attributes, and can use attributes to define namespaces. Accordingly, in XPath, the nodes are divided into seven types of nodes:

root node (rooted)
The root node is the topmost layer of a tree, and the root node is unique. All other element nodes on the tree are its child nodes or descendant nodes. The root node is treated the same as the other nodes. Matching a tree in XSLT always begins with the root node.

Elements node (element Nodes)
ELEMENT nodes correspond to each element in the document, and the child nodes of an element node can be element nodes, annotation nodes, processing instruction nodes, and text nodes. You can define a unique identity ID for an element node.
An element node can have an extension, which is made up of two parts: a namespace URI, and a local name for the other part.

Text node (Nodes)
A text node contains a set of character data, which is the characters contained in CDATA. No single text node will have an adjacent sibling text node, and the text node does not have an extension.

Attribute node (Attribute Nodes)
Each element node has an associated collection of attribute nodes, which are the parent node of each attribute node, but the attribute node is not a child of its parent element. That is, by finding the child nodes of an element can match the attribute node of the element, but the reverse is not true, just one-way. Again, the element's attribute node is not shared, meaning that different element nodes do not share the same attribute node.
The processing of the default property is equivalent to the defined property. If a property is declared in a DTD but declared as #implied, and the attribute is not defined in the element, the attribute is not included in the element's attribute node set.
In addition, attribute nodes that correspond to properties do not have a namespace declaration. A namespace property corresponds to another type of node.

Namespace node (Namespace Nodes)
Each element node has a related set of namespace nodes. In an XML document, namespaces are declared through reserved properties, so in XPath, the class node is very similar to the attribute node, the relationship between them and the parent element is unidirectional and is not shared.

Processing instruction node (processing instruction Nodes)
The processing instruction node corresponds to each processing instruction in the XML document. It also has an extension, where the local name of the extension points to the processing object, and the namespace part is empty.

Note Node (Comment Nodes)

Note nodes correspond to comments in the document. Below, let's construct an XML document tree:

<a id= "a1″>
<b id= "b1″>
<c id= "c1″>
<b name= "B"/>
<d id= "d1″/>
<e id= "e1″/>
<e id= "e2″/>
</C>
</B>
<b id= "b2″/>
<c id= "c2″>
<B/>
<d id= "d2″/>
<F/>
</C>
<E/>
</A>

Now, let's implement some basic methods that use XPath to match nodes in XML.

Path matching
Path matching is similar to the file path representation, which is better understood. There are several symbols:

Symbol
Meaning
For example
Match Results

/
Indicates the node path
/a/c/d
Child node "D" of Node "a" of child node "C", i.e. D node with ID value D2

/
Root node

//
All paths end With "//" after specified sub-path element
E
All e elements, the result is all three e elements

c/e
All of the parent nodes are e elements of C, and the result is two e elements with ID values of E1 and E2

*
Wildcard characters for a path
/a/b/c/*
A element →b all child elements under the element →c element, that is, the B element with the name value B, the D element with the ID value D1, and the two E elements with the ID value E1 and E2

/*/*/d
It has a D element with a level two node, and the matching result is a D element with an ID value of D2

//*
All the Elements

|
Logical OR
B | C
All B elements and C elements
Location matching
For each element, its individual child elements are ordered. Such as:

For example
Meaning
Match Results

/A/B/C[1]
The first child element of the element →c element of element a →b
The B element with the name value B

/a/b/c[last ()]
A element →b element →c The last child element of the element
An e element with an ID value of E2

/a/b/c[position () >1]
element with position number greater than 1 under element →c element →b elements
A D element with an ID value of D1 and two E elements with an ID value

Properties and Property values
In XPath, you can use attributes and attribute values to match elements, and be aware that the attribute name of an element must be preceded by an "@" prefix. For example:

For example
Meaning
Match Results

b[@id]
All B elements with a property ID
Two b elements with ID values of B1 and B2

B[@*]
All B elements that have attributes
Two b elements with id attribute and one with Name attribute b element

B[not (@*)]
All B elements that do not have attributes
Element B under Element a →c

b[@id = "B1"]
The B element with an ID value of B1
The b element under a element

Kinship matching
XML documents can be attributed to a tree structure, so no one node is orphaned. Usually we boil down the relationship between the nodes as a kinship, such as fathers, children, ancestors, descendants, brothers and so on. These concepts can also be used when matching elements. For example:

For example
Meaning
Match Results

E/parent::*
Parent node element for all E-nodes
A element with an ID value of A1 and a C element with an ID value of C1

F/ancestor::*
Ancestor node elements of all F elements
A element with an ID value of A1 and a C element with an ID value of C2

/a/child::*
Child elements of a
The ID value is b1, the B element of B2, the C element with the ID value C2, and the e element without any attributes

/a/descendant::*
All descendant elements of a
All other elements except the A element

F/self::*
All F's own elements
The F element itself

F/ancestor-or-self::*
All f elements and its ancestor node elements
f element, parent node of f element, c element and a element

/a/c/descendant-or-self::*
All a element →c elements and their descendant elements
The C element with the ID value C2, the element's child elements B, D, F elements

/a/c/following-sibling::*
A element of the →c element immediately after the next post all sibling node elements
The e element without any attributes

/a/c/preceding-sibling::*
A element →c the immediate front of all sibling node elements
Two b elements with ID values of B1 and B2

/a/b/c/following::*
All elements of a →b element →c element
b Elements with ID b2, c elements without attributes, B elements without attributes, D elements with ID D2, F elements without attributes, E elements without attributes.

/a/c/preceding::*
All elements in front of a element →c element
b element with ID B2, e element with ID E2, e element with ID e1, D element with id D1, b element with name B, c element with ID C1, b element with ID B1

Condition matching
Conditional matching is a Boolean value that uses the results of some functions to match a node that matches a condition. There are four types of functions commonly used for conditional matching: node function, string function, numeric function, Boolean function. For example, the last (), position (), and so forth mentioned earlier. These function functions can help us find the exact node we need.

function functions and functions:

Count () function: Count of statistics, returns the number of nodes that meet the criteria

Number () function: Converts the text in the value of a property to a value

SUBSTRING () function
Syntax: substring (value, start, length)
Intercept string

SUM () function: Sum

These functions are only part of the XPath syntax, and there are a number of function functions that are not introduced, and the current XPath syntax is still evolving. With these functions, we can implement more complex queries and operations.

Of these matching methods, the most used is the number of path matching. The node is positioned based on the sub-path relative to the current path.

Import lxml.htmlhtml = "<table cellspacing=" 0 "cellpadding=" 0 "><tbody><tr><td width=" 10% " nowrap= "" style= "padding-bottom:5px;" > Quantity: 1</td><td width= "90%" align= "right" nowrap= "style=" padding-right:5px; " ></td></tr></tbody></table> "doc = lxml.html.fromstring (html) numlist = Doc.xpath ('//td [@style = "padding-bottom:5px;" and @nowrap = "" and not (@align = "Right")]/text () ')

Python XPath basic usage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.