Python crawler Combat (4): Watercress Group Topic Data Collection-Dynamic Web page

Last Update:2016-07-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/84/31/wKioL1eIUpLwuiqxAAAgsDUJGvw380.jpg "title=" Python21.jpg "alt=" wkiol1eiuplwuiqxaaagsdujgvw380.jpg "/>1, Introduction

Note: The previous "Python Crawler Combat (3): Home room production broker information collection", access to the page is static Web page, a friend to imitate the actual combat to collect dynamic loading Watercress Group Web page, the results are unsuccessful. This article is for dynamic Web-based data acquisition programming combat.

python Open source web crawler project started, we put the web crawler into two categories: real-time crawler and harvesting network crawler. To accommodate a variety of application scenarios, Gooseeker's entire web crawler product line contains four products, as shown in:
650) this.width=650; "Src=" http://s4.51cto.com/wyfs02/M01/84/31/ Wkiom1eiuqkcmumiaacqqu-zcwq621.png "title=" python positioning. png "alt=" Wkiom1eiuqkcmumiaacqqu-zcwq621.png "/>

This is an example of the "independent Python Crawler", which collects the information of the Watercress Group Discussion topic (https://www.douban.com/group/haixiuzu/discussion?start=0) as an example, records the whole collection process, including Python and dependent library installations, even for Python beginners, can be successfully run with the article content.

installation of 2,python and associated dependent libraries

Operating Environment: WINDOWS10

2.1, install Python3.5.2

Official website Download Link: https://www.python.org/ftp/python/3.5.2/python-3.5.2.exe
When the download is complete, double-click Install.
This version will automatically install PIP and Setuptools for easy installation of other libraries

2.2,lxml 3.6.0

lxml website Address: http://lxml.de/
Windows Edition installation package download: http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
The installation file corresponding to Windows python3.5 is LXML-3.6.0-CP35-CP35M-WIN32.WHL
After the download is complete, open a command window under Windows, switch to the storage directory of the WHL file you just downloaded, run pip install LXML-3.6.0-CP35-CP35M-WIN32.WHL

2.3, download the Web content Extractor program

The Web content Extractor program is a class published by Gooseeker for the open source Python instant web crawler project, and using this class can greatly reduce the commissioning time of the data collection rules, see the Python Instant web crawler project: Definition of content Extractor

: https://github.com/FullerHua/gooseeker/blob/master/core/gooseeker.py
Save the gooseeker.py in the project directory

2.4, install selenium

Pip Install Selenium

2.5,phantomjs Download

Http://phantomjs.org/download.html
Unzip the downloaded Phantomjs-2.1.1-windows.zip to a folder on this machine
Record the full path of the Phantomjs.exe in the extracted folder and the file name to replace the browser = Webdriver of the code below. Phantomjs (executable_path= ' C:\phantomjs-2.1.1-windows\bin\phantomjs.exe ') the contents of the two single quotes in this line.

3, the source code of web crawler

# _*_coding:utf8_*_# douban.py#  crawl to take the Watercress group discussion topic from urllib import requestfrom  Lxml import etreefrom gooseeker import gsextractorfrom selenium import  webdriverclass phantomspider:    def getcontent (Self, url):         browser = webdriver. Phantomjs (executable_path= ' C:\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe ')          browser.get (URL)         time.sleep (3)          html = browser.execute_script ("return  Document.documentElement.outerHTML ")         output = etree. HTML (HTML)         return output    def  Savecontent (self, filepath, content):        file_obj = open (filepath,  ' W ',  encoding= ' UTF-8 ')          file_obj.write (content)         file_obj.close () DoubanExtra  = gsextractor ()    #  The following sentence calls Gooseeker's API to set the XSLT crawl rule #  The first parameter is App key, Please go to Gooseeker Member Center to apply for #  the second parameter is the rule name, which is ms  generated by Gooseeker's graphical tool:  plots Doubanextra.setxsltfromapi (" ffd5273e213036d812ea298922e2627b " , " Watercress Group Discussion topic ")   url = " https://www.douban.com/ Group/haixiuzu/discussion?start= "Totalpages = 5doubanspider = phantomspider () print (" Crawl start ") For pagenumber in range (1 , totalpages):     currenturl =  url + str ((pagenumber-1) *25)     print ("Crawling",  currenturl)      content = doubanspider.getcontent (Currenturl)     outputxml =  doubanextra.extract (ConTent)     outputfile =  "result"  + str (pagenumber)  + ". xml"      doubanspider.savecontent (Outputfile , str (outputxml)) print ("Crawl End")

Run the following procedure:

Save the above code in douban.py, and in the same folder as the Extractor class gooseeker.py the previous 2.3-Step download
Open the Windows CMD window, switch the current directory to the path that holds the douban.py (CD xxxxxxx)
Run Python douban.py

Please note: in order to make the source code more tidy, and in order to make the crawler more universal, the crawl rules are injected into the content extractor Bbsextra through the API, there is another benefit: if the target page structure changes, only through MS to re-edit the crawl rules, In this case, the web crawler code does not need to be modified. To download the collection rules for the content extractor see Python Instant Web crawler: API Description-Download content extractor.

4, crawler results

You can see multiple result**.xml files in the project directory, as shown in the file contents:
650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M02/84/31/wKioL1eIUrmTNPNCAACnk6Vtl5Y233.png "title=" Python21_1.png "alt=" Wkiol1eiurmtnpncaacnk6vtl5y233.png "/>

5, Summary

Because the information collection rules are downloaded through the API, the source code of this case appears to be very concise. At the same time, the entire program framework becomes universal, since the most common acquisition rules are injected from the outside.

6, set search Gooseeker source code Download

Gooseeker Open source Python instant web crawler GitHub source

7, Document modification history

2016-07-15:v1.0

This article from "Fullerhua blog" blog, declined reprint!

Python crawler Combat (4): Watercress Group Topic Data Acquisition-Dynamic Web page

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler Combat (4): Watercress Group Topic Data Collection-Dynamic Web page

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler Combat (4): Watercress Group Topic Data Collection-Dynamic Web page

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support