lxml is a Python library for reading and writing HTML and XML format data, and she can parse large files efficiently and reliably. Lxml has a programming interface lxml.html can be used to process HTML.
The lxml library has built-in support for XPath, so you can easily use XPath to get the contents of each label in an HTML file.
XPath is a language that looks for information in an XML document. XPath can be used to traverse elements and attributes in an XML document.
XPath is the primary element of the XPointer XSLT standard, and both XQuery and are built on XPath expressions.
Therefore, the understanding of XPath is the foundation of many advanced XML applications.
XPath's syntax is very simple, you can learn grammar from W3school, more than 10 minutes is enough.
All right, get to work. We get the first table on this page.
From lxml.html Import Parse
the from urllib.request import Urlopen
# uses Python3, Python2 may require the from URLLIB2 import Urlo Pen
doc = Parse (urlopen (' http://www.w3school.com.cn/xpath/xpath_syntax.asp '))
# opens URL, and use the Parse method to convert to a format that can be found using XPath
= Doc.xpath ('//table ')
find all the tables in the document and return a list
Let's look at the source code of the Web page and find the form that needs to be retrieved
<table class= "dataintable" >
<tr>
<th style= "width:25%;" > Expression </th>
<th> description </th>
</tr>
<tr>
<td>nodename</td >
<td> Select All child nodes for this node. </td>
</tr>
<tr>
<td>/</td>
<td> selected from the root node. </td>
</tr>
<tr>
<td>//</td>
<td> Select the node in the document from the current node that matches the selection. Regardless of their location. </td>
</tr>
<tr>
<td>.</td>
<td> Select the current node. </td>
</tr>
<tr>
<td> </td>
<td> Select the parent node of the current node. </td>
</tr>
<tr>
<td>@</td>
<td> Select Properties. </td>
</tr>
</table>
The first behavior title of the table, the following behavior data, we define a function to get them separately:
def _unpack (Row, kind= ' TD '):
ELTs = Row.xpath ('.//%s '%kind)
# Get data based on label type return
[Val.text_content () For Val in ELTs]
# Use list derivation to return a list
The following consolidates the data and converts it to the Dataframe type, pandas provides a textparse class that can be converted automatically to a type that automatically converts text types to the types we need.
From pandas.io.parsers import Textparser
def parse_options_data (table):
rows = Table.xpath ('.//tr ')
# Take the table as the current path, find the TR tag
header = _unpack (rows[0], kind= ' th ')
# Find the th label as header
data = [_unpack (r) for R in Rows[1: ]]
# The remaining line as data
return Textparser (data, Names=header). Get_chunk ()
# returns a Dataframe
To test:
Content = Parse_options_data (Tables[0])
NodeName selects all child nodes of this node.
0 / selected from the root node.
1 // Select the nodes in the document from the current node that matches the selection, regardless of their location.
2 . Select the current node.
3 .. Select the parent node of the current node.
4 @ Select Properties.