Python uses xslt to extract webpage data, and pythonxslt to extract webpage data

Last Update:2018-03-07 Source: Internet

Author: User

Tags xslt python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python uses xslt to extract webpage data, and pythonxslt to extract webpage data

1. Introduction

In the Python web crawler Content Extraction Tool article, we have explained in detail the core components: pluggable Content Extraction Tool class gsExtractor. This article records the programming experiments performed in the process of determining the technical route of gsExtractor. This is the first part. We used xslt to extract static webpage content and convert it into xml format.

2. Use the lxml library to extract webpage content

Lxml is a python library that can process XML quickly and flexibly. It supports XML Path Language (XPath) and Extensible Stylesheet Language Transformation (XSLT), and implements Common ElementTree APIs.

In the past two days, I tested how to extract webpage content through xslt in python. The record is as follows:

2.1 capture target

Assume that you want to extract the title and number of replies of the old forum on the official website of souk. For example, you need to extract the entire list and save it in xml format.

2.2 source code 1: only capture the current page, and the result is displayed on the console

The advantage of Python is that it can solve a problem with a small amount of code. Note that the following code looks very long. In fact, there are only a few python function calls, and An xslt script occupies a large space, in this Code, it is just a long string. For the reason why xslt is selected, rather than discrete xpath or scratched-head regular expressions, see the Python real-time web crawler project startup instructions. we hope to save the programmer's time by more than half through this architecture.
You can copy and run the following code (tested in windows10 and python3.2 ):

From urllib import request from lxml import etree url = "http://www.gooseeker.com/cn/forum/7" conn = request. urlopen (url) doc = etree. HTML (conn. read () into t_root = etree. XML ("" \ <xsl: stylesheet version = "1.0" xmlns: xsl = "http://www.w3.org/1999/XSL/Transform"> <xsl: template match = "/"> <list> <xsl: apply-templates select = "// * [@ id = 'Forum 'and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "/> </List> </xsl: template> <xsl: template match =" table/tbody/tr [position ()> = 1] "mode =" list "> <item> <title> <xsl: value-of select = "* // * [@ class = 'topic ']/a/text ()"/> <xsl: value-of select = "* [@ class = 'topic ']/a/text ()"/> <xsl: if test = "@ class = 'topic '"> <xsl: value-of select = "a/text ()"/> </xsl: if> </title> <replies> <xsl: value-of select = "* // * [@ class = 'replies ']/text () "/> <xsl: value-of select =" * [@ class = 'replies ']/text () "/> <xsl: if test = "@ class = 'replies '"> <xsl: value-of select = "text ()"/> </xsl: if> </reply count> </item> </xsl: template> <xsl: template match = "// * [@ id = 'Forum 'and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "> <item> <list> <xsl: apply-templates select =" table/tbody/tr [position ()> = 1] "mode =" list "/> </list> </item> </xsl: template> </xsl: stylesheet>") transform = etree. XSLT (effect_root) result_tree = transform (doc) print (result_tree)

Please download the source code from the GitHub source at the end of this article.

2.3 capture results

The results are as follows:

2.4. Source Code 2: capture pages and save results to files

We can further modify the code of 2.2 to add the page flip capture and save the result file function. The Code is as follows:

From urllib import request from lxml import etree import time into t_root = etree. XML ("" \ <xsl: stylesheet version = "1.0" xmlns: xsl = "http://www.w3.org/1999/XSL/Transform"> <xsl: template match = "/"> <list> <xsl: apply-templates select = "// * [@ id = 'Forum 'and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "/> </List> </xsl: template> <xsl: template match =" table/tbody /Tr [position ()> = 1] "mode =" list "> <item> <title> <xsl: value-of select = "* // * [@ class = 'topic ']/a/text ()"/> <xsl: value-of select = "* [@ class = 'topic ']/a/text ()"/> <xsl: if test = "@ class = 'topic '"> <xsl: value-of select = "a/text ()"/> </xsl: if> </title> <replies> <xsl: value-of select = "* // * [@ class = 'replies ']/text () "/> <xsl: value-of select =" * [@ class = 'replies ']/text () "/> <xsl: if test = "@ class = 'replies '"> <xsl: value-of select = "Text ()"/> </xsl: if> </Replies> </item> </xsl: template> <xsl: template match = "// * [@ id = 'forum' and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "> <item> <list> <xsl: apply-templates select =" table/tbody/tr [position ()> = 1] "mode =" list "/> </list> </item> </xsl: template> </xsl: stylesheet> """) baseurl = "http://www.gooseeker.com/cn/forum/7" basefilebegin = "jsk _ Bbs _ "basefileend =". xml "count = 1 while (count <12): url = baseurl + "? Page = "+ str (count) conn = request. urlopen (url) doc = etree. HTML (conn. read () transform = etree. XSLT (effect_root) result_tree = transform (doc) print (str (result_tree) file_obj = open (basefilebegin + str (count) + basefileend, 'w ', encoding = 'utf-8') file_obj.write (str (result_tree) file_obj.close () count + = 1 time. sleep (2)

We have added code for writing files and a loop to construct URLs for each page flip. However, what if the URLs remain unchanged during the page flip process? In fact, this is the content of dynamic web pages, which will be discussed below.

3. Summary

This is the verification process of the open-source Python General crawler project. In a crawler framework, other parts are easy to make common, that is, it is difficult to extract and convert webpage content into structured operations, we call it an extract. However, with the help of the GooSeeker visual extraction rule generator MS, the extraction generator generation process will become very convenient and can be standardized and inserted to achieve general crawlers, in subsequent articles, we will explain how MS works with Python.

4. Read more

The methods described in this article are usually used to capture static webpage content, that is, the content in the so-called html document. Currently, many website content is dynamically generated using javascript, and html does not have such content at first, dynamic Technology is required when post-loading is used. For details, refer to "Python crawler uses Selenium + PhantomJS to capture Ajax and dynamic HTML content".

5. Set souke GooSeeker open-source code download Source

1. GooSeeker open-source Python web crawler GitHub Source

6. Document modification history

: V2.0, adding text instructions; adding the post code

: V2.1, add the source code download source in the last chapter

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More