Python web crawler project: Definition of content extractor _python

Source: Internet
Author: User
Tags xslt python web crawler

1. Project background

In the Python instant web crawler Project Launch Note We discuss a number: Programmers waste time on debugging content extraction rules, so we launch this project, freeing programmers from cumbersome debugging rules into higher-end data-processing work.

2. The solution

To solve this problem, we isolate the extractor which affects generality and efficiency, and describe the following data processing flowchart:

The "Pluggable Extractor" in the diagram must be highly modular, and the key interfaces are:

    1. Standardized input: A standard HTML DOM object for input
    2. Standardized content extraction: Using standard XSLT templates to extract Web content
    3. Standardized output: Content extracted from a Web page in a standard XML format
    4. Explicit Extractor Plug Interface: The extractor is a well-defined class that interacts with the Crawler engine module through a class method

3. Extractor code

The pluggable extractor is the core component of the instant web crawler project, defined as a class: Gsextractor

Python source code files and their documentation please download from GitHub

The usage pattern is this:

    1. Instantiate a Gsextractor object
    2. Setting the XSLT extractor for this object is equivalent to configuring the object (using three Setxxx () methods).
    3. Input the HTML DOM to it to get the XML output (using the Extract () method)

Here is the source code for this gsextractor class

#!/usr/bin/python #-*-coding:utf-8-*-# module Name: Gooseeker # class Name: Gsextractor # version:2.0 # Description: HTML content Extractor # Features: Using XSLT as
For templates, quickly extract content from the HTML DOM. # Released by set search (http://www.gooseeker.com) on May, 2016 # github:https://github.com/fullerhua/jisou/core/ gooseeker.py from urllib Import request from Urllib.parse Import quote to lxml import etree import time Class Gsextrac Tor (object): Def _init_ (self): self.xslt = "# Read XSLT def setxsltfromfile from file (self, xsltfilepath): File = O Pen (Xsltfilepath, ' R ', encoding= ' UTF-8 ') try:self.xslt = File.read () finally:file.close () # from String  Get XSLT def setxsltfrommem (self, xsltstr): Self.xslt = xsltstr # obtains XSLT def Gooseeker via the Setxsltfromapi API interface (self , Apikey, Theme, Middle=none, bname=none): Apiurl = "http://www.gooseeker.com/api/getextractor?key=" + Apikey + "& Theme= "+quote (theme) if (middle): Apiurl = Apiurl +" &middle= "+quote (middle) if (bname): Apiurl = Apiurl + "; Bname= "+quote (bname) apiconn = Request.urlopen (apiurl) self.xslt = Apiconn.read () # returns the current XSLT def getxslt (self ): Return Self.xslt # Extraction method, the entry parameter is an HTML DOM object, which returns the result Def extract (self, html): Xslt_root = etree. XML (self.xslt) transform = etree.

 XSLT (xslt_root) Result_tree = transform (HTML) return Result_tree

4. Usage examples
Here is an example program that demonstrates how to use the Gsextractor class to extract a list of BBS posts in Gooseeker's official website. This example has the following characteristics

    1. XSLT templates used by the extractor are placed in the file in advance: Xslt_bbs.xml
    2. As an example, there are multiple XSLT sources in the actual usage scenario, the most mainstream source being the APIs on the Gooseeker platform
    3. Print out the extraction results on the console interface

The following is the source code, can be downloaded from the GitHub

#-*_coding:utf8-*-
# Use the Gsextractor class sample program
# Access Set Search Forum, XSLT for templates extract Forum Content
# XSLT saved in Xslt_bbs.xml from
Urllib import request from
lxml import etree
gooseeker import gsextractor

# Access and read Web page content
url = ' http:/ /WWW.GOOSEEKER.COM/CN/FORUM/7 "
conn = request.urlopen (URL)
doc = etree. HTML (Conn.read ())

# Generate Xsltextractor object
Bbsextra = Gsextractor ()
# Call the Set method to set the XSLT content
Bbsextra.setxsltfromfile ("Xslt_bbs.xml")
# calls the Extract method to extract the desired content result
= Bbsextra.extract (doc)
# Show fetch results
print (str (result))

The extraction results are shown in the following illustration:

The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.