Python real-time web crawler project: definition of content extraction server
1. Project Background
In the startup instructions of the Python Instant web crawler project, we discussed a number: the programmer wasted too much time on the extraction rules of the debugging content (see), so we initiated this project, free programmers from tedious debugging rules and put them into higher-end data processing.
This project has received great attention since it was launched. because of the open source code, you can further develop it on the basis of the ready-made source code. However, Python3 and Python2 are different. the source code of Python real-time web crawler project: definition of content extraction cannot be used in Python2.7. This article will publish a Python2.7 content extraction tool.
2. Solution
To solve this problem, we isolate the extractors that affect universality and work efficiency and describe the following data processing flowchart:
In the figure, the pluggable extractors must be very modular, so the key interfaces are:
Standardized input: the standard html dom object is used as the input.
Standardized content extraction: uses standard xslt templates to extract webpage content
Standardized output: outputs the content extracted from the web page in standard XML format.
Explicit extract plugging Interface: The extract is a clearly defined class that interacts with the crawler engine module through class methods.
3. extract code
Pluggable extractors are the core component of the real-time web crawler project and are defined as a class: the GsExtractor applies to the source code files of python2.7 and their instructions. Please download them from github.
The usage mode is as follows:
Instantiate a GsExtractor object
Setting xslt extractors for this object is equivalent to configuring this object (using three setXXX () methods)
Input the html dom to get the xml output (using the extract () method)
The following is the source code of the GsExtractor class (applicable to Python2.7)
#! /Usr/bin/python #-*-coding: UTF-8-*-# module name: gooseeker_py2 # class name: GsExtractor # Version: 2.0 # Python Version adaptation: 2.7 # description: html content extraction tool # function: uses xslt as a template to quickly extract html dom content. # Released by set search ( http://www.gooseeker.com ) On May 18,201 6 # github: https://github.com/FullerHua/jisou/core/gooseeker_py2.pyfrom Urllib2 import urlopenfrom urllib import quotefrom lxml import etreeimport timeclass GsExtractor (object): def _ init _ (self): self. xslt = "" # read xslt def setjavastfromfile (self, effectfilepath) from a file: file = open (effectfilepath, 'r') try: self. xslt = file. read () finally: file. close () # obtain xslt def setinclutfrommem (self, effectstr) from the string: self. xslt = effectstr # Use the GooSeeker API to obtain the xslt def setmediatfromapi (self, APIKey, theme, middle = None, bname = None): apiurl =" http://www.gooseeker.com /Api/getextractor? Key = "+ APIKey +" & theme = "+ quote (theme) if (middle): apiurl = apiurl +" & middle = "+ quote (middle) if (bname ): apiurl = apiurl + "& bname =" + quote (bname) apiconn = urlopen (apiurl) self. xslt = apiconn. read () # returns the current xslt def getXslt (self): return self. xslt # extraction method. the input parameter is an html dom object. the returned result is the extraction result def extract (self, html): effect_root = etree. XML (self. xslt) transform = etree. XSLT (effect_root) result_tree = transform (html) return result_tree
4. usage example
The following is an example program that demonstrates how to use the GsExtractor class to extract the topic of the Douban discussion group. This example has the following features:
The extract content is obtained through the api on the GooSeeker platform.
Save the result file to the current folder.
The following is the source code, which can be downloaded from github.
# _ * _ Coding: utf8 _ * _ # douban_py2.py # topic for group discussion # Python version: 2.7 from lxml import etreefrom gooseeker_py2 import GsExtractorfrom selenium import webdriverimport timeclass PhantomSpider: def getContent (self, url): browser = webdriver. phantomJS (executable_path = 'C: \ phantomjs-2.1.1-windows \ bin \ phantomjs.exe ') browser. get (url) time. sleep (3) html = browser.exe cute_script ("return document.doc umentElement. outerHTML ") output = etree. HTML (html) return output def saveContent (self, filepath, content): file_obj = open (filepath, 'w') file_obj.write (content) file_obj.close () doubanExtra = GsExtractor () # The following sentence calls gooseeker's api to set xslt crawling rules # The first parameter is the app key. please apply at GooSeeker's Member Center # The second parameter is the rule name, it is a doubanExtra generated by GooSeeker's graphical tool: Several MS. setmediatfromapi ("ffd5273e213036d812ea298922e2627b", "Douban group discussion") url =" https://www.douban.com/group/haixiuzu/discussion?start= "Totalpages = 5 doubanSpider = PhantomSpider () print (" crawling start ") for pagenumber in range (1, totalpages): currenturl = url + str (pagenumber-1) * 25) print ("crawling", currenturl) content = doubanSpider. getContent (currenturl) outputxml = doubanExtra. extract (content) outputfile = "result" + str (pagenumber) + ". xml "doubanSpider. saveContent (outputfile, str (outputxml) print ("crawling ends ")
Shows the extraction result:
This article has already explained the value and usage of the extraction tool, but does not explain how to generate it. only the rapid generation of the extraction tool can save developer time.