Python real-time web crawler project: definition of content extraction server

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Project Background

In the startup instructions of the Python Instant web crawler project, we discussed a number: the programmer wasted too much time on the extraction rules of the debugging content (see), so we initiated this project, free programmers from tedious debugging rules and put them into higher-end data processing.

This project has received great attention since it was launched. because of the open source code, you can further develop it on the basis of the ready-made source code. However, Python3 and Python2 are different. the source code of Python real-time web crawler project: definition of content extraction cannot be used in Python2.7. This article will publish a Python2.7 content extraction tool.

2. Solution

To solve this problem, we isolate the extractors that affect universality and work efficiency and describe the following data processing flowchart:

In the figure, the pluggable extractors must be very modular, so the key interfaces are:

Standardized input: the standard html dom object is used as the input.

Standardized content extraction: uses standard xslt templates to extract webpage content

Standardized output: outputs the content extracted from the web page in standard XML format.

Explicit extract plugging Interface: The extract is a clearly defined class that interacts with the crawler engine module through class methods.

3. extract code

Pluggable extractors are the core component of the real-time web crawler project and are defined as a class: the GsExtractor applies to the source code files of python2.7 and their instructions. Please download them from github.

The usage mode is as follows:

Instantiate a GsExtractor object

Setting xslt extractors for this object is equivalent to configuring this object (using three setXXX () methods)

Input the html dom to get the xml output (using the extract () method)

The following is the source code of the GsExtractor class (applicable to Python2.7)

#! /Usr/bin/python #-*-coding: UTF-8-*-# module name: gooseeker_py2 # class name: GsExtractor # Version: 2.0 # Python Version adaptation: 2.7 # description: html content extraction tool # function: uses xslt as a template to quickly extract html dom content. # Released by set search ( http://www.gooseeker.com ) On May 18,201 6 # github: https://github.com/FullerHua/jisou/core/gooseeker_py2.pyfrom Urllib2 import urlopenfrom urllib import quotefrom lxml import etreeimport timeclass GsExtractor (object): def _ init _ (self): self. xslt = "" # read xslt def setjavastfromfile (self, effectfilepath) from a file: file = open (effectfilepath, 'r') try: self. xslt = file. read () finally: file. close () # obtain xslt def setinclutfrommem (self, effectstr) from the string: self. xslt = effectstr # Use the GooSeeker API to obtain the xslt def setmediatfromapi (self, APIKey, theme, middle = None, bname = None): apiurl =" http://www.gooseeker.com /Api/getextractor? Key = "+ APIKey +" & theme = "+ quote (theme) if (middle): apiurl = apiurl +" & middle = "+ quote (middle) if (bname ): apiurl = apiurl + "& bname =" + quote (bname) apiconn = urlopen (apiurl) self. xslt = apiconn. read () # returns the current xslt def getXslt (self): return self. xslt # extraction method. the input parameter is an html dom object. the returned result is the extraction result def extract (self, html): effect_root = etree. XML (self. xslt) transform = etree. XSLT (effect_root) result_tree = transform (html) return result_tree

4. usage example

The following is an example program that demonstrates how to use the GsExtractor class to extract the topic of the Douban discussion group. This example has the following features:

The extract content is obtained through the api on the GooSeeker platform.

Save the result file to the current folder.

The following is the source code, which can be downloaded from github.

# _ * _ Coding: utf8 _ * _ # douban_py2.py # topic for group discussion # Python version: 2.7 from lxml import etreefrom gooseeker_py2 import GsExtractorfrom selenium import webdriverimport timeclass PhantomSpider: def getContent (self, url): browser = webdriver. phantomJS (executable_path = 'C: \ phantomjs-2.1.1-windows \ bin \ phantomjs.exe ') browser. get (url) time. sleep (3) html = browser.exe cute_script ("return document.doc umentElement. outerHTML ") output = etree. HTML (html) return output def saveContent (self, filepath, content): file_obj = open (filepath, 'w') file_obj.write (content) file_obj.close () doubanExtra = GsExtractor () # The following sentence calls gooseeker's api to set xslt crawling rules # The first parameter is the app key. please apply at GooSeeker's Member Center # The second parameter is the rule name, it is a doubanExtra generated by GooSeeker's graphical tool: Several MS. setmediatfromapi ("ffd5273e213036d812ea298922e2627b", "Douban group discussion") url =" https://www.douban.com/group/haixiuzu/discussion?start= "Totalpages = 5 doubanSpider = PhantomSpider () print (" crawling start ") for pagenumber in range (1, totalpages): currenturl = url + str (pagenumber-1) * 25) print ("crawling", currenturl) content = doubanSpider. getContent (currenturl) outputxml = doubanExtra. extract (content) outputfile = "result" + str (pagenumber) + ". xml "doubanSpider. saveContent (outputfile, str (outputxml) print ("crawling ends ")

Shows the extraction result:

This article has already explained the value and usage of the extraction tool, but does not explain how to generate it. only the rapid generation of the extraction tool can save developer time.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python real-time web crawler project: definition of content extraction server

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python real-time web crawler project: definition of content extraction server

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support