Use lxml XPath to read a table in a Web page and convert it to a pandas dataframe

Source: Internet
Author: User
Tags xpath

lxml is a Python library for reading and writing HTML and XML format data, and she can parse large files efficiently and reliably. Lxml has a programming interface lxml.html can be used to process HTML.

The lxml library has built-in support for XPath, so you can easily use XPath to get the contents of each label in an HTML file.

XPath is a language that looks for information in an XML document. XPath can be used to traverse elements and attributes in an XML document.

XPath is the primary element of the XPointer XSLT standard, and both XQuery and are built on XPath expressions.

Therefore, the understanding of XPath is the foundation of many advanced XML applications.

XPath's syntax is very simple, you can learn grammar from W3school, more than 10 minutes is enough.

All right, get to work. We get the first table on this page.

From lxml.html Import Parse

the from urllib.request import Urlopen

# uses Python3, Python2 may require the from URLLIB2 import Urlo Pen

doc = Parse (urlopen (' http://www.w3school.com.cn/xpath/xpath_syntax.asp '))

# opens URL, and use the Parse method to convert to a format that can be found using XPath

= Doc.xpath ('//table ')

find all the tables in the document and return a list

Let's look at the source code of the Web page and find the form that needs to be retrieved

<table class= "dataintable" >     

<tr>

<th style= "width:25%;" > Expression </th>

<th> description </th>

</tr>

<tr>

<td>nodename</td >

<td> Select All child nodes for this node. </td>

</tr>

<tr>

<td>/</td>

<td> selected from the root node. </td>

</tr>

<tr>

<td>//</td>

<td> Select the node in the document from the current node that matches the selection. Regardless of their location. </td> 

</tr>

<tr>

<td>.</td>

<td> Select the current node. </td>

</tr>

<tr>

<td> </td>

<td> Select the parent node of the current node. </td>

</tr>

<tr>

<td>@</td>

<td> Select Properties. </td>

</tr>

</table>

The first behavior title of the table, the following behavior data, we define a function to get them separately:

def _unpack (Row, kind= ' TD '):

    ELTs = Row.xpath ('.//%s '%kind)

# Get data based on label type return

    [Val.text_content () For Val in ELTs]

# Use list derivation to return a list

The following consolidates the data and converts it to the Dataframe type, pandas provides a textparse class that can be converted automatically to a type that automatically converts text types to the types we need.

From pandas.io.parsers import Textparser

def parse_options_data (table):

    rows = Table.xpath ('.//tr ')

# Take the table as the current path, find the TR tag

    header = _unpack (rows[0], kind= ' th ')

# Find the th label as header

    data = [_unpack (r) for R in Rows[1: ]]

  # The remaining line as data

    return Textparser (data, Names=header). Get_chunk ()

  # returns a Dataframe

To test:

Content = Parse_options_data (Tables[0])
  NodeName                   selects all child nodes of this node.

0        /                        selected from the root node.

1       //  Select the nodes in the document from the current node that matches the selection, regardless of their location.

2        .                        Select the current node.

3       ..                    Select the parent node of the current node.

4        @                          Select Properties.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.