The art of data grabbing (i): SELENIUM+PHANTOMJS data Crawl environment configuration 2013-05-15 15:08:14
Category: Python/ruby
Data fetching is an art, and unlike other software, there is no perfect, consistent, universal Crawler in the world. For different purposes, different code needs to be customized. However, we do not have to start from Scratch, there are a number of basic tools, basic methods and infrastructure to use. Different tools, different methods, different frameworks have different characteristics. Understanding these tools, methods, and frameworks is a top priority, and then you need to understand where they differ, what context to use, and, finally, the precipitation rules, writing code, and running programs to fetch data. So, in fact, the data crawl learning route, not only very long and very miscellaneous.
For a specific purpose, I need to crawl Google's search number, and other circumstances are different: people are specific keywords, page by page crawl results; I am n multiple keywords, one at a time to search, only need to return the number of search bars. In fact, there are a total of 153 keywords, but each keyword needs and all key words handshake group to be treated as a test phrase. So, you can imagine, a 153 row, 153 columns of the large table, each blank is waiting to fill, this will be 153*153=23409 times, that is, about 23409/2 = 11,704 times, tested every crawl of a common word page results and stored in Excel, it takes 4 seconds. This means that a single-threaded way takes 11704*4/3600=13 hours to run.
I'll go over the details in the next blog post, and now I'll start with the technical framework I'm using and the installation configuration process.
first, technical framework
[
Python2.7 + Pip + Selenium + phantomjs]
Selenium+phantomjs, originally the brothers are not a family, and later found that they are congenial, mutual goodwill, so sworn to brother, live into the Selenium home. (this argument is debatable)
Look at the introduction:
Selenium is a tool for Web application testing. The selenium test runs directly in the browser, just as the real user is doing. Supported browsers include IE, Mozilla Firefox, Chrome, and more.
Phantom JS is the WebKit of a server-side JavaScript API. It supports a variety of Web standards: DOM processing, CSS selectors, JSON, Canvas, and SVG.
Second, the environment construction
(1) Install Python withheld, I use the version is 2.7.4 (WinXP and Win7 32-bit platform).
(2) Since I found PIP to be superior to easy_install, I used the easy_install I have installed to install PIP.
- Easy_install pip
:
(3) Install the PHANTOMJS.
To Phantomjs's official website http://phantomjs.org/download.html, download "Download Phantomjs-1.9.0-windows.zip (7.1 MB)". Then open the package, Phantomjs.exe this file to the system path can be found, because I have added the "C:\Python27\Scripts" directory into the path, so I directly extracted to this directory. :
So far, the environment has been configured in the win environment.
third, testing
Feel free to create a new file and add the following code:
- From selenium import Webdriver
- Driver = Webdriver. PHANTOMJS ()
- Driver.get (' www.baidu.com ')
- data = driver.find_element_by_id (' cp '). Text
- Print data
Check out to see if this results:
In fact, I always do not like to play out of the DOS black box, I think this stuff too affect the vision and may be more time-consuming, but after I read the official doc:
I found that I could not directly hide the DOS bullet box. So, it has to be so.
Ok,it ' s time to "Enjoy yourself" ...
See also below: "The Art of Data Capture (ii): Data Capture program optimization and capturing Google's experience"
The art of data grabbing (i): SELENIUM+PHANTOMJS Data Capture Environment configuration