Using XPath to parse HTML in Python

Source: Internet
Author: User
Tags xpath

in the Web page crawl, the analysis of the location of the HTML node is the key to capture information, I am using the lxml module (to analyze the structure of the XML document, of course, can also analyze the HTML structure), Use its lxml.html XPath to parse the HTML to get the crawl information:first, we need to install a Python library that supports XPath. Currently in LIBXML2 's website is recommended python binding is lxml, also has beautifulsoup, not too troublesome words can also use regular expression to build, this article takes lxml as an example to explain. Suppose that the HTML document
1 <HTML>2   <Body>3     <form>4       <DivID= ' Leftmenu '>5         <H3>Text</H3>6         <ul id= ' China '><!-- First Location -7           <Li>...</Li>8           <Li>...</Li>9              ......Ten         </ul> One         <ul id= ' England ' ><!--Second location - A           <Li>...</Li> -           <Li>...</Li> -              ...... the         </ul> -       </Div> -     </form> -   </Body> + </HTML>        

Direct use of lxml processing:

 mport Codecs  2  from  lxml "import   etree  3  F=codecs.open ( Span style= "color: #800000;" > " ceshi.html  " ,  " r  " ,  utf-8  "   4  content=f.read ()  5  f.close ()  6  tree=etree. HTML (content)  

Etree provides this parsing function of HTML, now we can use the XPath directly to the HTML, is not a little excited, now try it.

Before using XPath, let's take a look at jquery and re as a control.

It's easy to deal with this in jQuery , especially if the UL node has an ID (such as <ul id= ' China' >):

$ (' #China '). each (function () {...});

Here are the details:

$ ("#leftmenu"). Children ("h3:contains (' text ')"). Next ("UL"). each (function () {...});

Locate the node with ID Leftmenu, under which a H3 node containing the text is found, followed by one of its next UL nodes.

In Python it's a little trickier to use RE to deal with it:

Block_pattern=re.compile (U"", re. I | Re. S) m=block_pattern.findall (content) Item_pattern=re.compile (u"<li> (. *?) </li>", re. I | Re. S) Items=item_pattern.findall (m[0]) for in items:      Print I

So what do you do with XPath? In fact, it's the same as jquery:

Nodes=tree.xpath ("/descendant::ul[@id = 'China ']")

Of course, there is no ID at this time, it can only be used similar to the jquery method. The complete XPath should be written like this (note that the tag in the original file is case-sensitive, but only in lowercase in XPath):

Nodes=tree.xpath (U"/html/body/form/div[@id = ' Leftmenu ']/h3[text () = ' text ']/following-sibling::ul[1] ")

An easier way to do this is to directly target the ID like jquery:

Nodes=tree.xpath (U"//div[@id = ' Leftmenu ']/h3[text () = ' text ']/following-sibling::ul[1]" )

In the results returned by both methods, Nodes[0] is the first UL node immediately following the H3 node of the "text", so that all the UL node content behind it can be listed.

If there are other nodes under the UL node, we need to find the content of the deeper nodes, such as the following loop is to list the text content of these nodes:

Nodes=nodes[0].xpath ("li/a") for in nodes:     Print n.text

Comparing three methods it should be seen that both XPath and jquery parse the page based on XML semantics, and re is purely based on the plain text. Re against a simple page is no problem, if the page structure of high complexity (such as a bunch of div nested back and forth), design a proper re pattern may be far more complex than writing an XPath. In particular, the current mainstream CSS-based page design, most of the key nodes will have id―― for the use of jquery pages, the more so, when the XPath compared to the re has a decisive advantage.

Appendix: Introduction to Basic XPath syntax, please refer to the Official document of XPath for details

XPath is basically a tree-like approach to describing the path in an XML document. For example, use "/" as a separation between the upper and lower levels. The first "/" represents the root node of the document (note that it does not refer to the outermost tag node of the document, but rather the document itself). For example, for an HTML file, the outermost node should be "/html".

The same, "..." and "." are used to represent the parent node and this node, respectively.

The XPath returned is not necessarily the only node, but all the nodes that meet the criteria. For example, using "/html/head/scrpt" in an HTML document would take all the script nodes in the head.

In order to narrow the positioning range, it is often necessary to increase the filter conditions. The filter method is to use "[" "]" to add the filter conditions. For example, use "/html/body/div[@id = ' main ']" in an HTML document to remove the DIV node with the ID of main in the body.

Where @id represents the property ID, similar can also be used such as @name, @value, @href, @src, @class ....

and the function text () means to get the text that the node contains. For example:<div>hello<p>world</p></div>, use "div[text () = ' Hello ']" to get this div, and world is P's text ().

The function position () means to get the position of the node. For example, "li[position () =2]" means to get the second Li node, it can also be omitted as "li[2]".

However, it is important to note the order of the digital positioning and filtering conditions. For example "ul/li[5][@name = ' Hello ']" means take UL under the fifth item Li, and its name must be hello, otherwise return empty. And if the meaning of "ul/li[@name = ' Hello '][5]" is different, it means looking for the fifth Li node named "Hello" under UL.

In addition, "*" can replace all node names, such as "/html/body/*/span" can be removed from the body under the second level of all spans, regardless of whether it is a div or p or something else.

The "descendant::" prefix can refer to arbitrary multi-layered intermediate nodes, which can also be omitted as a "/". For example, in the entire HTML document look for a div with id "leftmenu", you can use "/descendant::d iv[@id = ' Leftmenu ']", or simply use "//div[@id = ' leftmenu ']".

As for the "following-sibling::" prefix, as its name says, represents the next node in the same layer. "Following-sibling::*" is any next node, and "Following-sibling::ul" is the next UL node.

Using XPath to parse HTML in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.