Python uses xslt to extract webpage data, and pythonxslt to extract webpage data

Source: Internet
Author: User
Tags xslt python web crawler

Python uses xslt to extract webpage data, and pythonxslt to extract webpage data

1. Introduction

In the Python web crawler Content Extraction Tool article, we have explained in detail the core components: pluggable Content Extraction Tool class gsExtractor. This article records the programming experiments performed in the process of determining the technical route of gsExtractor. This is the first part. We used xslt to extract static webpage content and convert it into xml format.

2. Use the lxml library to extract webpage content

Lxml is a python library that can process XML quickly and flexibly. It supports XML Path Language (XPath) and Extensible Stylesheet Language Transformation (XSLT), and implements Common ElementTree APIs.

In the past two days, I tested how to extract webpage content through xslt in python. The record is as follows:

2.1 capture target

Assume that you want to extract the title and number of replies of the old forum on the official website of souk. For example, you need to extract the entire list and save it in xml format.

 

2.2 source code 1: only capture the current page, and the result is displayed on the console

The advantage of Python is that it can solve a problem with a small amount of code. Note that the following code looks very long. In fact, there are only a few python function calls, and An xslt script occupies a large space, in this Code, it is just a long string. For the reason why xslt is selected, rather than discrete xpath or scratched-head regular expressions, see the Python real-time web crawler project startup instructions. we hope to save the programmer's time by more than half through this architecture.
You can copy and run the following code (tested in windows10 and python3.2 ):

From urllib import request from lxml import etree url = "http://www.gooseeker.com/cn/forum/7" conn = request. urlopen (url) doc = etree. HTML (conn. read () into t_root = etree. XML ("" \ <xsl: stylesheet version = "1.0" xmlns: xsl = "http://www.w3.org/1999/XSL/Transform"> <xsl: template match = "/"> <list> <xsl: apply-templates select = "// * [@ id = 'Forum 'and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "/> </List> </xsl: template> <xsl: template match =" table/tbody/tr [position ()> = 1] "mode =" list "> <item> <title> <xsl: value-of select = "* // * [@ class = 'topic ']/a/text ()"/> <xsl: value-of select = "* [@ class = 'topic ']/a/text ()"/> <xsl: if test = "@ class = 'topic '"> <xsl: value-of select = "a/text ()"/> </xsl: if> </title> <replies> <xsl: value-of select = "* // * [@ class = 'replies ']/text () "/> <xsl: value-of select =" * [@ class = 'replies ']/text () "/> <xsl: if test = "@ class = 'replies '"> <xsl: value-of select = "text ()"/> </xsl: if> </reply count> </item> </xsl: template> <xsl: template match = "// * [@ id = 'Forum 'and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "> <item> <list> <xsl: apply-templates select =" table/tbody/tr [position ()> = 1] "mode =" list "/> </list> </item> </xsl: template> </xsl: stylesheet>") transform = etree. XSLT (effect_root) result_tree = transform (doc) print (result_tree)

Please download the source code from the GitHub source at the end of this article.

2.3 capture results

The results are as follows:

 

2.4. Source Code 2: capture pages and save results to files

We can further modify the code of 2.2 to add the page flip capture and save the result file function. The Code is as follows:

From urllib import request from lxml import etree import time into t_root = etree. XML ("" \ <xsl: stylesheet version = "1.0" xmlns: xsl = "http://www.w3.org/1999/XSL/Transform"> <xsl: template match = "/"> <list> <xsl: apply-templates select = "// * [@ id = 'Forum 'and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "/> </List> </xsl: template> <xsl: template match =" table/tbody /Tr [position ()> = 1] "mode =" list "> <item> <title> <xsl: value-of select = "* // * [@ class = 'topic ']/a/text ()"/> <xsl: value-of select = "* [@ class = 'topic ']/a/text ()"/> <xsl: if test = "@ class = 'topic '"> <xsl: value-of select = "a/text ()"/> </xsl: if> </title> <replies> <xsl: value-of select = "* // * [@ class = 'replies ']/text () "/> <xsl: value-of select =" * [@ class = 'replies ']/text () "/> <xsl: if test = "@ class = 'replies '"> <xsl: value-of select = "Text ()"/> </xsl: if> </Replies> </item> </xsl: template> <xsl: template match = "// * [@ id = 'forum' and count (. /table/tbody/tr [position ()> = 1 and count (. // * [@ class = 'topic ']/a/text ()> 0])> 0] "mode =" list "> <item> <list> <xsl: apply-templates select =" table/tbody/tr [position ()> = 1] "mode =" list "/> </list> </item> </xsl: template> </xsl: stylesheet> """) baseurl = "http://www.gooseeker.com/cn/forum/7" basefilebegin = "jsk _ Bbs _ "basefileend =". xml "count = 1 while (count <12): url = baseurl + "? Page = "+ str (count) conn = request. urlopen (url) doc = etree. HTML (conn. read () transform = etree. XSLT (effect_root) result_tree = transform (doc) print (str (result_tree) file_obj = open (basefilebegin + str (count) + basefileend, 'w ', encoding = 'utf-8') file_obj.write (str (result_tree) file_obj.close () count + = 1 time. sleep (2)

We have added code for writing files and a loop to construct URLs for each page flip. However, what if the URLs remain unchanged during the page flip process? In fact, this is the content of dynamic web pages, which will be discussed below.

3. Summary

This is the verification process of the open-source Python General crawler project. In a crawler framework, other parts are easy to make common, that is, it is difficult to extract and convert webpage content into structured operations, we call it an extract. However, with the help of the GooSeeker visual extraction rule generator MS, the extraction generator generation process will become very convenient and can be standardized and inserted to achieve general crawlers, in subsequent articles, we will explain how MS works with Python.

4. Read more

The methods described in this article are usually used to capture static webpage content, that is, the content in the so-called html document. Currently, many website content is dynamically generated using javascript, and html does not have such content at first, dynamic Technology is required when post-loading is used. For details, refer to "Python crawler uses Selenium + PhantomJS to capture Ajax and dynamic HTML content".

5. Set souke GooSeeker open-source code download Source

1. GooSeeker open-source Python web crawler GitHub Source

6. Document modification history

: V2.0, adding text instructions; adding the post code

: V2.1, add the source code download source in the last chapter

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.