Python Instant web crawler project: Definition of content Extractor

Source: Internet
Author: User
Tags xslt

1. Project background

In the Python instant web crawler Project Launch Note We discuss a number: programmers waste too much time on debugging content extraction rules (see), so we launched this project, freeing programmers from cumbersome debugging rules and putting them into higher-end data processing.

This project has been a great concern since the introduction of open source, we can be developed on the basis of off-the-shelf source. However, Python3 and Python2 are different, the Python instant web crawler project: the definition of content extractor, the source code cannot be used under Python2.7, this article will publish a Python2.7 content extractor.

2. Solution

In order to solve this problem, we isolate the extractor which affects the universality and efficiency, and describe the following data processing flowchart:

The "Pluggable extractor" in the figure must be highly modular, so the key interfaces are:

Normalized input: As input to standard HTML DOM objects

Standardized content extraction: extracting Web content using standard XSLT templates

Normalized output: Output extracted from a Web page in a standard XML format

Explicit extractor plug-in interface: Extractor is a well-defined class that interacts with the Crawler engine module through class methods

3. Extractor code

Pluggable Extractor is the core component of the instant web crawler project, defined as a class: Gsextractor for python2.7 source code files and their documentation please download from GitHub

The usage pattern is this:

Instantiate a Gsextractor object

Setting an XSLT extractor for this object is equivalent to configuring this object (using the three-class Setxxx () method).

Input the HTML DOM to it to get the XML output (using the Extract () method)

Here is the source code for this Gsextractor class (for Python2.7)

#!/usr/bin/python#-*-coding:utf-8-*-# Module Name: gooseeker_py2# class Name: gsextractor# version:2.0# adaptation python version: 2.7# Description: HTML content Extraction # function: Use XSLT as a template to quickly extract content from the HTML DOM. # Released by collection (http://www.gooseeker.com) on the May, 2016# Github:https://github.com/fullerhua/jisou/core/gooseeker_ Py2.pyfrom urllib2 Import urlopenfrom urllib import quotefrom lxml import etreeimport timeclass gsextractor (object): De F _init_ (self): Self.xslt = "" # Read XSLT def setxsltfromfile (self, xsltfilepath) from file: "File = Open" (XSLTF Ilepath, ' r ') Try:self.xslt = File.read () finally:file.close () # get XSLT D from a string EF Setxsltfrommem (Self, xsltstr): self.xslt = xsltstr # get XSLT def Gooseeker via Setxsltfromapi API interface (self, AP IKey, Theme, Middle=none, bname=none): Apiurl = "http://www.gooseeker.com/api/getextractor?key=" + APIKey + "&th Eme= "+quote (theme) if (middle): Apiurl = Apiurl +" &middle= "+quote (middle) if (bname): Apiurl = Apiurl + "&bname=" +quote (bname) apiconn = Urlopen (apiurl) self.xslt = Apiconn.read () # Returns the current XSLT def getxslt (self): Return Self.xslt # Extract method, entry parameter is an HTML DOM object, return is extract result def extract (self, htm L): Xslt_root = etree. XML (self.xslt) transform = etree. XSLT (xslt_root) Result_tree = transform (HTML) return Result_tree

4. Usage examples

Here is an example program that shows how to use the Gsextractor class to extract a watercress discussion group topic. This example has the following characteristics:

The contents of the extractor are obtained through the API on the Gooseeker platform

Save the result file to the current folder

Here is the source code, which can be downloaded from github

# _*_coding:utf8_*_# douban_py2.py# Crawl The Watercress group Discussion topic # Python version: 2.7from lxml import etreefrom gooseeker_py2 import Gsextractorf Rom Selenium import webdriverimport timeclass phantomspider:def getcontent (self, url): Browser = Webdriver.        Phantomjs (executable_path= ' C:\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe ') browser.get (URL) time.sleep (3) html = Browser.execute_script ("return Document.documentElement.outerHTML") output = etree.        HTML (HTML) return output def savecontent (self, filepath, content): File_obj = open (filepath, ' W ') File_obj.write (content) file_obj.close () Doubanextra = Gsextractor () # The following sentence calls the Gooseeker API to set the XSLT crawl rule # The first parameter is the app K EY, please apply to Gooseeker Member Center # The second parameter is the rule name, which is a graphical tool through Gooseeker: a doubanextra.setxsltfromapi that generates a number of MS (" ffd5273e213036d812ea298922e2627b "," Watercress Group discussion topic ") URL =" Https://www.douban.com/group/haixiuzu/discussion?start= " TotalPages = 5doubanSpider = Phantomspider () print ("Crawl start") for PageNumber in range (1, TotaLpages): Currenturl = URL + str ((pagenumber-1) *25) print ("Crawling", currenturl) content = Doubanspider.getcontent (cu Rrenturl) Outputxml = doubanextra.extract (content) outputfile = "Result" + str (pagenumber) + ". Xml" Doubanspider.s Avecontent (OutputFile, str (outputxml)) print ("Crawl End")

The extraction results are as follows:

This article has explained the value and usage of the extractor, but does not say how to generate it, only the rapid generation of extractor to achieve the purpose of saving developers time

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.