Python uses xslt to extract webpage data, and pythonxslt to extract webpage data
1. Introduction
In the Python web crawler Content Extraction Tool article, we have explained in detail the core components: pluggable Content Extraction Tool class gsExtractor. This article records the programming experiments performed in the process of determining the technical route of gsExtractor. This is the first part. We used xslt to extract static webpage content and convert it into xml format.
2. Use the lxml library to extract webpage content
Lxml is a python library that can process XML quickly and flexibly. It supports XML Path Language (XPath) and Extensible Stylesheet Language Transformation (XSLT), and implements Common ElementTree APIs.
In the past two days, I tested how to extract webpage content through xslt in python. The record is as follows:
2.1 capture target
Assume that you want to extract the title and number of replies of the old forum on the official website of souk. For example, you need to extract the entire list and save it in xml format.
2.2 source code 1: only capture the current page, and the result is displayed on the console
The advantage of Python is that it can solve a problem with a small amount of code. Note that the following code looks very long. In fact, there are only a few python function calls, and An xslt script occupies a large space, in this Code, it is just a long string. For the reason why xslt is selected, rather than discrete xpath or scratched-head regular expressions, see the Python real-time web crawler project startup instructions. we hope to save the programmer's time by more than half through this architecture.
You can copy and run the following code (tested in windows10 and python3.2 ):
From urllib import request from lxml import etree url = "http://www.gooseeker.com/cn/forum/7" conn = request. urlopen (url) doc = etree. HTML (conn. read () into t_root = etree. XML ("" \ <xsl: stylesheet version = "1.0" xmlns: xsl = "http://www.w3.org/1999/XSL/Transform"> <xsl: template match = "/"> <list> <xsl: apply-templates select = "// * [@ id = 'Forum 'and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "/> </List> </xsl: template> <xsl: template match =" table/tbody/tr [position ()> = 1] "mode =" list "> <item> <title> <xsl: value-of select = "* // * [@ class = 'topic ']/a/text ()"/> <xsl: value-of select = "* [@ class = 'topic ']/a/text ()"/> <xsl: if test = "@ class = 'topic '"> <xsl: value-of select = "a/text ()"/> </xsl: if> </title> <replies> <xsl: value-of select = "* // * [@ class = 'replies ']/text () "/> <xsl: value-of select =" * [@ class = 'replies ']/text () "/> <xsl: if test = "@ class = 'replies '"> <xsl: value-of select = "text ()"/> </xsl: if> </reply count> </item> </xsl: template> <xsl: template match = "// * [@ id = 'Forum 'and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "> <item> <list> <xsl: apply-templates select =" table/tbody/tr [position ()> = 1] "mode =" list "/> </list> </item> </xsl: template> </xsl: stylesheet>") transform = etree. XSLT (effect_root) result_tree = transform (doc) print (result_tree)
Please download the source code from the GitHub source at the end of this article.
2.3 capture results
The results are as follows:
2.4. Source Code 2: capture pages and save results to files
We can further modify the code of 2.2 to add the page flip capture and save the result file function. The Code is as follows:
From urllib import request from lxml import etree import time into t_root = etree. XML ("" \ <xsl: stylesheet version = "1.0" xmlns: xsl = "http://www.w3.org/1999/XSL/Transform"> <xsl: template match = "/"> <list> <xsl: apply-templates select = "// * [@ id = 'Forum 'and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "/> </List> </xsl: template> <xsl: template match =" table/tbody /Tr [position ()> = 1] "mode =" list "> <item> <title> <xsl: value-of select = "* // * [@ class = 'topic ']/a/text ()"/> <xsl: value-of select = "* [@ class = 'topic ']/a/text ()"/> <xsl: if test = "@ class = 'topic '"> <xsl: value-of select = "a/text ()"/> </xsl: if> </title> <replies> <xsl: value-of select = "* // * [@ class = 'replies ']/text () "/> <xsl: value-of select =" * [@ class = 'replies ']/text () "/> <xsl: if test = "@ class = 'replies '"> <xsl: value-of select = "Text ()"/> </xsl: if> </Replies> </item> </xsl: template> <xsl: template match = "// * [@ id = 'forum' and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "> <item> <list> <xsl: apply-templates select =" table/tbody/tr [position ()> = 1] "mode =" list "/> </list> </item> </xsl: template> </xsl: stylesheet> """) baseurl = "http://www.gooseeker.com/cn/forum/7" basefilebegin = "jsk _ Bbs _ "basefileend =". xml "count = 1 while (count <12): url = baseurl + "? Page = "+ str (count) conn = request. urlopen (url) doc = etree. HTML (conn. read () transform = etree. XSLT (effect_root) result_tree = transform (doc) print (str (result_tree) file_obj = open (basefilebegin + str (count) + basefileend, 'w ', encoding = 'utf-8') file_obj.write (str (result_tree) file_obj.close () count + = 1 time. sleep (2)
We have added code for writing files and a loop to construct URLs for each page flip. However, what if the URLs remain unchanged during the page flip process? In fact, this is the content of dynamic web pages, which will be discussed below.
3. Summary
This is the verification process of the open-source Python General crawler project. In a crawler framework, other parts are easy to make common, that is, it is difficult to extract and convert webpage content into structured operations, we call it an extract. However, with the help of the GooSeeker visual extraction rule generator MS, the extraction generator generation process will become very convenient and can be standardized and inserted to achieve general crawlers, in subsequent articles, we will explain how MS works with Python.
4. Read more
The methods described in this article are usually used to capture static webpage content, that is, the content in the so-called html document. Currently, many website content is dynamically generated using javascript, and html does not have such content at first, dynamic Technology is required when post-loading is used. For details, refer to "Python crawler uses Selenium + PhantomJS to capture Ajax and dynamic HTML content".
5. Set souke GooSeeker open-source code download Source
1. GooSeeker open-source Python web crawler GitHub Source
6. Document modification history
: V2.0, adding text instructions; adding the post code
: V2.1, add the source code download source in the last chapter
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.