Recently prepared to delve into Python-related crawler knowledge, if you use Python crawl relatively formal web pages using "Urllib2 + BeautifulSoup + Regular expression " can be done, then dynamically generated information pages, such as Ajax, JavaScript and so on need to pass "phantomjs + casperjs + Selenium" to achieve. So let's start with installation and feature introduction, and later on some Python-related crawler applications.
I. Introduction
phantomjs
Phantomjs is a server-side JavaScript APIWebkit(open-source browser engine). It supports a variety of Web standards: DOM processing, CSS selectors, JSON, Canvas and SVG. PHANTOMJS can be used for page automation, network monitoring, web screen screenshots, and no interface testing.
Selenium
Selenium is a tool for Web application testing. The selenium test runs directly in the browser, just as the real user is doing. Supported browsers include IE (7, 8, 9), Mozilla Firefox, Mozilla Suite, and more. The main features of this tool include testing compatibility with browsers, testing system functionality, and ThoughtWorks an acceptance testing tool specifically written for Web applications.
PIP
before introducing them, need to install PIP software. AsXifeijian Great Godsaid: "As a python enthusiast, if you do not know easy_install or any of the PIP, then ...".
Easy_insall's role is similar to the gem in Perl in Cpan,ruby, providing a handy way to install the module online, while Pip is an improved version of Easy_install, providing better hints, removing the package and more. The old version of Python has only easy_install and no PIP. Common usage is as follows:
Easy_install usage: 1) Install a package $ easy_install <package_name> $ easy_install "<package_name>== <version> " 2) Upgrade a package $ easy_install-u" <package_name>>=<version> " pip usage 1) Install a package $ pip Install <package_name> pip install <package_name>==<version> 2) Upgrade a package ( If you do not provide a version number, upgrade to the latest version) $ pip Install--upgrade <package_name>>=<version> 3) Delete a package $ Pip Uninstall <package_name>
two. Install Pip
First step: download pip software
can be on the official websiteHttp://pypi.python.org/pypi/pip#downloadsdownload, while the CD switches to the PIP directory, installed in the Python setup.py install. And I used to download Pip-win_1.7.exe to install, as follows:
Https://sites.google.com/site/pydatalog/python/pip-for-windows
Step two: Install PIP software
When prompted" Pip and Virtualenv installed "means the installation is successful, then how to test PIP installation success?
step Three: Configure environment variables
Entering the pip instruction in CMD will prompt the error "not internal or external command".
so you need to add the PATH environment variable. After the PIP installation is complete, the Python\scripts directory is added under the Python installation directory, in the scripts directory of the Python installation directory, add this directory to the environment variable! The process is as follows:
Fourth step: Using the PIP command
use the pip command in cmd below, "Pip list Outdate" to enumerate the version information of the Python installation library.
PIP commands commonly used are as follows: (refer to PIP installation for detailed instructions)
USAGE:PIP <command> [options] commands:install installation software. Uninstall uninstall the software. Freeze output the list of installed software in a certain format lists list installed software. Show shows the software details. Search software, like search in Yum. Wheel Build wheels from your requirements. Zip is not recommended. Zip individual packages. Unzip not recommended. Unzip individual packages. Bundles are not recommended. Create Pybundles. Help current assistance. General Options:-H,--help display Help. -V,--verbose more output, can use up to 3 times-V,--version realistic version information and then exit. -Q,--quiet the least output. --log-file <path> Overwrite records verbose error log, default file:/root/.pip/pip.log--log <path> do not overwrite record ver A log of the Bose output. --proxy <proxy> Specify a proxy in the form [User:[email proTected]]proxy.server:port. --timeout <sec> Connection time-out (default 15 seconds). --exists-action <action> default activity when a path is always present: (s) witch, (i) Gnore, (W) Ipe, (b) ackup. --cert <path> certificates.
three. Installing Phantomjs+selenium
To install selenium with the PIP command:
official website http://phantomjs.org/download PHANTOMJS after decompression as shown:
call may error "Unable to
start Phantomjs with Ghostdriver "
at this point you can set the path of the next Phantomjs, and if you configure the scripts directory environment variable, you can extract Phantomjs to the folder. Reference:Selenium with Ghostdriver in Python on Windows-stackoverflow
four. Test Code
The code after setting the Executable_path path is as follows:
From selenium import webdriverdriver = Webdriver. Phantomjs (executable_path= "F:\Python\phantomjs-1.9.1-windows\phantomjs.exe") driver.get ("http://www.baidu.com") data = Driver.titleprint Data
The results of the operation are as follows:
get "Baidu a bit, you know", the corresponding HTML source:
<title> Baidu a bit, you know </title>
But always pop phantomjs black box, how to do? At the same time how to call Phantomjs directly from Python to run JS it?
PS: I am going to use C # to invoke PhantomJS.exe to complete the page function, but not successful, and use the WebBrowser DrawToBitmap function to get the picture, because the ActiveX control does not support the DrawToBitmap method, gets always blank, various problems.
reference:
Data Capture Art (i): SELENIUM+PHANTOMJS Data Capture environment configuration (strong push)
data capture Art (ii): Data Crawler optimization
python using JS extensions for SELENIUM/PHANTOMJS (strong push)
python Selenium
use PYTHON+PHANTOMJS to crawl dynamic pages
use PHANTOMJS for full page screenshots
pyspider crawler Tutorial (iii): Use PHANTOMJS to render pages with JS
"PHP" ". NET" "J S "AJAX" about crawling Web source code
Writing the ultimate crawler with Python/casperjs-client app crawl
& nbsp How Python crawlers Get the URL and Web content generated by JS-from
get web pages via WebBrowser-C #
&NBS P; control.drawtobitmap method does not support AJAX-official website
ie Browser Full Screen screenshot program (ii)-C #
& nbsp c# Network Programming The simplest browser implementation-yourself
finally hope that the basic article to help you! If there are shortcomings, please Haihan ~
(By:eastmount 2015-8-19 night 8 o'clock http://blog.csdn.net/eastmount/)
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
[Python crawler] installs pip+phantomjs+selenium under Windows