Python crawler Combat (4): Watercress Group Topic Data Collection-Dynamic Web page

Source: Internet
Author: User

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/84/31/wKioL1eIUpLwuiqxAAAgsDUJGvw380.jpg "title=" Python21.jpg "alt=" wkiol1eiuplwuiqxaaagsdujgvw380.jpg "/>1, Introduction

Note: The previous "Python Crawler Combat (3): Home room production broker information collection", access to the page is static Web page, a friend to imitate the actual combat to collect dynamic loading Watercress Group Web page, the results are unsuccessful. This article is for dynamic Web-based data acquisition programming combat.

python Open source web crawler project started, we put the web crawler into two categories: real-time crawler and harvesting network crawler. To accommodate a variety of application scenarios, Gooseeker's entire web crawler product line contains four products, as shown in:
650) this.width=650; "Src=" http://s4.51cto.com/wyfs02/M01/84/31/ Wkiom1eiuqkcmumiaacqqu-zcwq621.png "title=" python positioning. png "alt=" Wkiom1eiuqkcmumiaacqqu-zcwq621.png "/>

This is an example of the "independent Python Crawler", which collects the information of the Watercress Group Discussion topic (https://www.douban.com/group/haixiuzu/discussion?start=0) as an example, records the whole collection process, including Python and dependent library installations, even for Python beginners, can be successfully run with the article content.

installation of 2,python and associated dependent libraries
    • Operating Environment: WINDOWS10

2.1, install Python3.5.2
    • Official website Download Link: https://www.python.org/ftp/python/3.5.2/python-3.5.2.exe

    • When the download is complete, double-click Install.

    • This version will automatically install PIP and Setuptools for easy installation of other libraries

2.2,lxml 3.6.0
    • lxml website Address: http://lxml.de/

    • Windows Edition installation package download: http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

    • The installation file corresponding to Windows python3.5 is LXML-3.6.0-CP35-CP35M-WIN32.WHL

    • After the download is complete, open a command window under Windows, switch to the storage directory of the WHL file you just downloaded, run pip install LXML-3.6.0-CP35-CP35M-WIN32.WHL

2.3, download the Web content Extractor program

The Web content Extractor program is a class published by Gooseeker for the open source Python instant web crawler project, and using this class can greatly reduce the commissioning time of the data collection rules, see the Python Instant web crawler project: Definition of content Extractor

    • : https://github.com/FullerHua/gooseeker/blob/master/core/gooseeker.py

    • Save the gooseeker.py in the project directory

2.4, install selenium
    • Pip Install Selenium

2.5,phantomjs Download
    • Http://phantomjs.org/download.html

    • Unzip the downloaded Phantomjs-2.1.1-windows.zip to a folder on this machine

    • Record the full path of the Phantomjs.exe in the extracted folder and the file name to replace the browser = Webdriver of the code below. Phantomjs (executable_path= ' C:\phantomjs-2.1.1-windows\bin\phantomjs.exe ') the contents of the two single quotes in this line.

3, the source code of web crawler
# _*_coding:utf8_*_# douban.py#  crawl to take the Watercress group discussion topic from urllib import requestfrom  Lxml import etreefrom gooseeker import gsextractorfrom selenium import  webdriverclass phantomspider:    def getcontent (Self, url):         browser = webdriver. Phantomjs (executable_path= ' C:\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe ')          browser.get (URL)         time.sleep (3)          html = browser.execute_script ("return  Document.documentElement.outerHTML ")         output = etree. HTML (HTML)         return output    def  Savecontent (self, filepath, content):        file_obj = open (filepath,  ' W ',  encoding= ' UTF-8 ')          file_obj.write (content)         file_obj.close () DoubanExtra  = gsextractor ()    #  The following sentence calls Gooseeker's API to set the XSLT crawl rule #  The first parameter is App key, Please go to Gooseeker Member Center to apply for #  the second parameter is the rule name, which is ms  generated by Gooseeker's graphical tool:  plots Doubanextra.setxsltfromapi (" ffd5273e213036d812ea298922e2627b " , " Watercress Group Discussion topic ")   url = " https://www.douban.com/ Group/haixiuzu/discussion?start= "Totalpages = 5doubanspider = phantomspider () print (" Crawl start ") For pagenumber in range (1 , totalpages):     currenturl =  url + str ((pagenumber-1) *25)     print ("Crawling",  currenturl)      content = doubanspider.getcontent (Currenturl)     outputxml =  doubanextra.extract (ConTent)     outputfile =  "result"  + str (pagenumber)  + ". xml"      doubanspider.savecontent (Outputfile , str (outputxml)) print ("Crawl End")

Run the following procedure:

    • Save the above code in douban.py, and in the same folder as the Extractor class gooseeker.py the previous 2.3-Step download

    • Open the Windows CMD window, switch the current directory to the path that holds the douban.py (CD xxxxxxx)

    • Run Python douban.py

Please note: in order to make the source code more tidy, and in order to make the crawler more universal, the crawl rules are injected into the content extractor Bbsextra through the API, there is another benefit: if the target page structure changes, only through MS to re-edit the crawl rules, In this case, the web crawler code does not need to be modified. To download the collection rules for the content extractor see Python Instant Web crawler: API Description-Download content extractor.

4, crawler results

You can see multiple result**.xml files in the project directory, as shown in the file contents:
650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M02/84/31/wKioL1eIUrmTNPNCAACnk6Vtl5Y233.png "title=" Python21_1.png "alt=" Wkiol1eiurmtnpncaacnk6vtl5y233.png "/>

5, Summary

Because the information collection rules are downloaded through the API, the source code of this case appears to be very concise. At the same time, the entire program framework becomes universal, since the most common acquisition rules are injected from the outside.

6, set search Gooseeker source code Download
    1. Gooseeker Open source Python instant web crawler GitHub source

7, Document modification history

2016-07-15:v1.0


This article from "Fullerhua blog" blog, declined reprint!

Python crawler Combat (4): Watercress Group Topic Data Acquisition-Dynamic Web page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.