650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/84/31/wKioL1eIUpLwuiqxAAAgsDUJGvw380.jpg "title=" Python21.jpg "alt=" wkiol1eiuplwuiqxaaagsdujgvw380.jpg "/>1, Introduction
Note: The previous "Python Crawler Combat (3): Home room production broker information collection", access to the page is static Web page, a friend to imitate the actual combat to collect dynamic loading Watercress Group Web page, the results are unsuccessful. This article is for dynamic Web-based data acquisition programming combat.
python Open source web crawler project started, we put the web crawler into two categories: real-time crawler and harvesting network crawler. To accommodate a variety of application scenarios, Gooseeker's entire web crawler product line contains four products, as shown in:
650) this.width=650; "Src=" http://s4.51cto.com/wyfs02/M01/84/31/ Wkiom1eiuqkcmumiaacqqu-zcwq621.png "title=" python positioning. png "alt=" Wkiom1eiuqkcmumiaacqqu-zcwq621.png "/>
This is an example of the "independent Python Crawler", which collects the information of the Watercress Group Discussion topic (https://www.douban.com/group/haixiuzu/discussion?start=0) as an example, records the whole collection process, including Python and dependent library installations, even for Python beginners, can be successfully run with the article content.
installation of 2,python and associated dependent libraries
2.1, install Python3.5.2
Official website Download Link: https://www.python.org/ftp/python/3.5.2/python-3.5.2.exe
When the download is complete, double-click Install.
This version will automatically install PIP and Setuptools for easy installation of other libraries
2.2,lxml 3.6.0
lxml website Address: http://lxml.de/
Windows Edition installation package download: http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
The installation file corresponding to Windows python3.5 is LXML-3.6.0-CP35-CP35M-WIN32.WHL
After the download is complete, open a command window under Windows, switch to the storage directory of the WHL file you just downloaded, run pip install LXML-3.6.0-CP35-CP35M-WIN32.WHL
2.3, download the Web content Extractor program
The Web content Extractor program is a class published by Gooseeker for the open source Python instant web crawler project, and using this class can greatly reduce the commissioning time of the data collection rules, see the Python Instant web crawler project: Definition of content Extractor
2.4, install selenium
2.5,phantomjs Download
Http://phantomjs.org/download.html
Unzip the downloaded Phantomjs-2.1.1-windows.zip to a folder on this machine
Record the full path of the Phantomjs.exe in the extracted folder and the file name to replace the browser = Webdriver of the code below. Phantomjs (executable_path= ' C:\phantomjs-2.1.1-windows\bin\phantomjs.exe ') the contents of the two single quotes in this line.
3, the source code of web crawler
# _*_coding:utf8_*_# douban.py# crawl to take the Watercress group discussion topic from urllib import requestfrom Lxml import etreefrom gooseeker import gsextractorfrom selenium import webdriverclass phantomspider: def getcontent (Self, url): browser = webdriver. Phantomjs (executable_path= ' C:\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe ') browser.get (URL) time.sleep (3) html = browser.execute_script ("return Document.documentElement.outerHTML ") output = etree. HTML (HTML) return output def Savecontent (self, filepath, content): file_obj = open (filepath, ' W ', encoding= ' UTF-8 ') file_obj.write (content) file_obj.close () DoubanExtra = gsextractor () # The following sentence calls Gooseeker's API to set the XSLT crawl rule # The first parameter is App key, Please go to Gooseeker Member Center to apply for # the second parameter is the rule name, which is ms generated by Gooseeker's graphical tool: plots Doubanextra.setxsltfromapi (" ffd5273e213036d812ea298922e2627b " , " Watercress Group Discussion topic ") url = " https://www.douban.com/ Group/haixiuzu/discussion?start= "Totalpages = 5doubanspider = phantomspider () print (" Crawl start ") For pagenumber in range (1 , totalpages): currenturl = url + str ((pagenumber-1) *25) print ("Crawling", currenturl) content = doubanspider.getcontent (Currenturl) outputxml = doubanextra.extract (ConTent) outputfile = "result" + str (pagenumber) + ". xml" doubanspider.savecontent (Outputfile , str (outputxml)) print ("Crawl End")
Run the following procedure:
Save the above code in douban.py, and in the same folder as the Extractor class gooseeker.py the previous 2.3-Step download
Open the Windows CMD window, switch the current directory to the path that holds the douban.py (CD xxxxxxx)
Run Python douban.py
Please note: in order to make the source code more tidy, and in order to make the crawler more universal, the crawl rules are injected into the content extractor Bbsextra through the API, there is another benefit: if the target page structure changes, only through MS to re-edit the crawl rules, In this case, the web crawler code does not need to be modified. To download the collection rules for the content extractor see Python Instant Web crawler: API Description-Download content extractor.
4, crawler results
You can see multiple result**.xml files in the project directory, as shown in the file contents:
650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M02/84/31/wKioL1eIUrmTNPNCAACnk6Vtl5Y233.png "title=" Python21_1.png "alt=" Wkiol1eiurmtnpncaacnk6vtl5y233.png "/>
5, Summary
Because the information collection rules are downloaded through the API, the source code of this case appears to be very concise. At the same time, the entire program framework becomes universal, since the most common acquisition rules are injected from the outside.
6, set search Gooseeker source code Download
Gooseeker Open source Python instant web crawler GitHub source
7, Document modification history
2016-07-15:v1.0
This article from "Fullerhua blog" blog, declined reprint!
Python crawler Combat (4): Watercress Group Topic Data Acquisition-Dynamic Web page