Python Instant web crawler project: Definition of content Extractor

Source: Internet
Author: User
Tags xslt

1. Project Background
in thePython instant web crawler Project Launch Instructionswe discuss a number: Programmers waste time on debugging content extraction rules, so we launch this project, freeing programmers from cumbersome debugging rules and putting them into higher-end data processing.

2. Solution
in order to solve this problem, we isolate the extractor which affects the universality and efficiency, and describe the following data processing flowchart:

650) this.width=650; "id=" aimg_878 "src=" http://www.gooseeker.com/doc/data/attachment/forum/201605/19/ 165346f6xui8rox8oo68uf.png "class=" Zoom "width=" 487 "height=" 224 "style=" margin-top:10px; "alt=" 165346f6xui8rox8oo68uf.png "/>

The "Pluggable extractor" in the figure must be highly modular, so the key interfaces are:

    • Normalized input: As input to standard HTML DOM objects

    • Standardized content extraction: extracting Web content using standard XSLT templates

    • Normalized output: Output extracted from a Web page in a standard XML format

    • Explicit extractor plug-in interface: Extractor is a well-defined class that interacts with the Crawler engine module through class methods


3. Extractor code
The pluggable Extractor is the core component of the instant web crawler project, defined as a class: Gsextractor
       python source code files and their documentation please download from GitHub

#!/usr/bin/python# -*- coding: utf-8 -*-#  Module Name: gooseeker#  class name:  gsextractor# version: 1.0#  Description: The  html content extractor #  feature:  use XSLT as a template to quickly extract content from Html dom. # released by  collection (http://www.gooseeker.com)  on may 18, 2016# github:  https://github.com/FullerHua/jisou/core/gooseeker.pyfrom urllib import requestfrom  urllib.parse import quotefrom lxml import etreeimport timeclass  Xsltextractor (object):     def _init_ (self):         self.xslt =  ""     #  read xslt    def  from File Setxsltfromfile (Self , xsltfilepath):         file =  open (xsltfilepath ,  ' r '  , encoding= ' UTF-8 ')          try:     &nbSp;      self.xslt = file.read ()          finally:            file.close ()      #  get Xslt    def setxsltfrommem (SELF , XSLTSTR) from a string:         self.xslt = xsltStr    #  Xslt    def setxsltfromapi (self , apikey ) obtained through the Gooseeker api interface.  theme):        apiurl =  "Http://test.gooseeker.com/api /getextractor?key= "+ apikey +" &theme= "+quote (theme)          apiconn = request.urlopen (Apiurl)         self.xslt  = apiconn.read ()     #  returns the current XSLT    DEF GETXSLT ( Self):         return self.xslt    #  extraction method, the entry parameter is a Html dom object, the return is the extraction result     def extract (self , html):         xslt_root = etree. XML (SELF.XSLT)         transform = etree. XSLT (xslt_root)         result_tree = transform (HTML)          return result_tree


4. Usage example
Below is an example program that shows how to use the Gsextractor class to extract a list of BBS posts from the Gooseeker website. This example has the following characteristics:

    • Extractor is placed in the file in advance: Xslt_bbs.xml

    • As an example, in the actual usage scenario, there are multiple XSLT sources, The most mainstream source is the API

    • On the Gooseeker platform print out the extract results on the console interface


github  download

#-*_coding:utf8-*-# using the Gsextractor class sample program # Access Set Search forum, extract forum content in XSLT for template # XSLT saved in Xslt_bbs.xml from Urllib import Requestfrom lxml Import etreefrom gooseeker import gsextractor# access and read web content url = "HTTP://WWW.GOOSEEKER.COM/CN/FORUM/7" conn = Request.urlopen (URL) doc = etree. HTML (Conn.read ()) # generates Xsltextractor object Bbsextra = Gsextractor () # calls the Set method setting the XSLT content bbsextra.setxsltfromfile ("Xslt_ Bbs.xml ") # Call the Extract method to extract the required content result = Bbsextra.extract (DOC) # show fetch result print (str (result))


the extraction results are as follows:
650) this.width=650; "id=" aimg_881 "src=" http://www.gooseeker.com/doc/data/attachment/forum/201605/20/ 115959xzwhowcbownoaco9.png "class=" Zoom "width=" 532 "style=" margin-top:10px; "alt=" 115959xzwhowcbownoaco9.png "/ >

5. Next Read
This article has explained the value and usage of the extractor, but did not say how to generate it, only the rapid generation of extractor to achieve the purpose of saving developers time, this question will be explained in other articles, please see1-minute fast generation of XSLT templates for Web page content extraction。

6. History of Document Modification
2016-05-27:v2.0, supplementary project background introduction and value statement
2016-05-27:v2.1, implements the method of extracting XSLT from the Gooseeker API interface of the Extractor class


This article from "Fullerhua blog" blog, declined reprint!

Python Instant web crawler project: Definition of content Extractor

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.