1. Project background
In the Python instant web crawler Project Launch Note We discuss a number: Programmers waste time on debugging content extraction rules, so we launch this project, freeing programmers from cumbersome debugging rules into higher-end data-processing work.
2. The solution
To solve this problem, we isolate the extractor which affects generality and efficiency, and describe the following data processing flowchart:
The "Pluggable Extractor" in the diagram must be highly modular, and the key interfaces are:
- Standardized input: A standard HTML DOM object for input
- Standardized content extraction: Using standard XSLT templates to extract Web content
- Standardized output: Content extracted from a Web page in a standard XML format
- Explicit Extractor Plug Interface: The extractor is a well-defined class that interacts with the Crawler engine module through a class method
3. Extractor code
The pluggable extractor is the core component of the instant web crawler project, defined as a class: Gsextractor
Python source code files and their documentation please download from GitHub
The usage pattern is this:
- Instantiate a Gsextractor object
- Setting the XSLT extractor for this object is equivalent to configuring the object (using three Setxxx () methods).
- Input the HTML DOM to it to get the XML output (using the Extract () method)
Here is the source code for this gsextractor class
#!/usr/bin/python #-*-coding:utf-8-*-# module Name: Gooseeker # class Name: Gsextractor # version:2.0 # Description: HTML content Extractor # Features: Using XSLT as
For templates, quickly extract content from the HTML DOM. # Released by set search (http://www.gooseeker.com) on May, 2016 # github:https://github.com/fullerhua/jisou/core/ gooseeker.py from urllib Import request from Urllib.parse Import quote to lxml import etree import time Class Gsextrac Tor (object): Def _init_ (self): self.xslt = "# Read XSLT def setxsltfromfile from file (self, xsltfilepath): File = O Pen (Xsltfilepath, ' R ', encoding= ' UTF-8 ') try:self.xslt = File.read () finally:file.close () # from String Get XSLT def setxsltfrommem (self, xsltstr): Self.xslt = xsltstr # obtains XSLT def Gooseeker via the Setxsltfromapi API interface (self , Apikey, Theme, Middle=none, bname=none): Apiurl = "http://www.gooseeker.com/api/getextractor?key=" + Apikey + "& Theme= "+quote (theme) if (middle): Apiurl = Apiurl +" &middle= "+quote (middle) if (bname): Apiurl = Apiurl + "; Bname= "+quote (bname) apiconn = Request.urlopen (apiurl) self.xslt = Apiconn.read () # returns the current XSLT def getxslt (self ): Return Self.xslt # Extraction method, the entry parameter is an HTML DOM object, which returns the result Def extract (self, html): Xslt_root = etree. XML (self.xslt) transform = etree.
XSLT (xslt_root) Result_tree = transform (HTML) return Result_tree
4. Usage examples
Here is an example program that demonstrates how to use the Gsextractor class to extract a list of BBS posts in Gooseeker's official website. This example has the following characteristics
- XSLT templates used by the extractor are placed in the file in advance: Xslt_bbs.xml
- As an example, there are multiple XSLT sources in the actual usage scenario, the most mainstream source being the APIs on the Gooseeker platform
- Print out the extraction results on the console interface
The following is the source code, can be downloaded from the GitHub
#-*_coding:utf8-*-
# Use the Gsextractor class sample program
# Access Set Search Forum, XSLT for templates extract Forum Content
# XSLT saved in Xslt_bbs.xml from
Urllib import request from
lxml import etree
gooseeker import gsextractor
# Access and read Web page content
url = ' http:/ /WWW.GOOSEEKER.COM/CN/FORUM/7 "
conn = request.urlopen (URL)
doc = etree. HTML (Conn.read ())
# Generate Xsltextractor object
Bbsextra = Gsextractor ()
# Call the Set method to set the XSLT content
Bbsextra.setxsltfromfile ("Xslt_bbs.xml")
# calls the Extract method to extract the desired content result
= Bbsextra.extract (doc)
# Show fetch results
print (str (result))
The extraction results are shown in the following illustration:
The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.