Python+pyspider+phantomjs for simple crawler functions

Source: Internet
Author: User

There are two purposes for this article:
1. Documenting the process of building a crawler environment
2. Summary of the Crawler project experience

First, the system environment
The program is tested on 32-bit ubuntu10.04 and 64-bit centos6.9, and the software required is as follows:
1.ubuntu10.04 or centos6.9, the following are mainly centos6.9 to illustrate
2.pyspider source code, can be downloaded from here to http://download.csdn.net/detail/king_bingge/8582249, can also be downloaded from the official website Https://github.com/binux/pyspider
3. For PHANTOMJS, because of the use of Flash in the Crawler project, and the latest version of the PHANTOMJS is removed from the Flash plugin support (of course, this is to phantomjs higher efficiency), So I chose to keep the PHANTOMJS version of the Flash plugin, which can be downloaded from here to http://download.csdn.net/detail/king_bingge/8582351
4. Of course, since we use Pyspider, then your system will definitely support Python, my system's Python version is 2.7.7, you can download it from here to http://download.csdn.net/detail/king_bingge/8582267
5. In order to let PHANTOMJS support Flash, then you also need to download a flash plugin, for the Falsh plugin, you can download from here to http://download.csdn.net/detail/king_bingge/8582273. , of course, the official website may be a better choice.

So far, the main software we need is already in place, and the environment is set up below.

Second, installation Phantomjs

1. Before installing the PHANTOMJS, install the dependent environment first

-y xorg-x11-server-Xvfb xorg-x11-server-Xorg xorg-x11-fonts* dbus-x11 xulrunner.x86_64 nspr.x86_64 nss.-y  flash-plugin nspluginwrapper

2. Then perform RPM-IVH phantomjs-1.9-1.x86_64.rpm, after completion, perform the following:

[root@localhost pyspider]# phantomjs  -v1.10.0 (development)[root@localhost pyspider]# 

Can see the version information

3. Then execute the following command to view the command-line arguments:

[Email protected] pyspider]# PHANTOMJS-HUSAGE:PHANTOMJS [switchs] [options] [script] [argument [argument [...]] Options:--cookies-file=<val> Sets the file name to store the persistent cookie--config=<val> Specifies JSON-formatted configuration file--debug=<val> Prints Additional warning  and debug message:  ' True ' or ' false ' (default)--disk-cache=<val> enables disk cache: ' true ' or ' False ' (default)--ignore-ssl-errors=<val> ignores SSL errors (expired/self-signed certifica Te errors): ' true ' or ' false ' (default)--load-images=<val> Loads all inlined images: ' true ' (default) 
     
      or 
      ' false ' 
     --load-plugins=<val> Loads all plugins: ' true ' (default) or c5> ' false ' 

We can see that there is a –load-plugins= parameter, which is the parameter that our home needs to take with the plugin.
4. Don't forget that our Flash plugin hasn't been installed yet.
Decompression: TAR-XF install_flash_player_11_linux.x86_64.tar.gz
CP Libflashplayer.so/usr/lib/mozilla/plugins
cp./usr/*/usr
This will install successfully. You can use Adobe Flash Player.

5. How do we test the success of our installed PHANTOMJS? A test script is given below:

var  page = require  ( ' webpage ' ). Create ();p Age.open (,  function   ()  { window.settimeout ( function   ()  { page.render ( ' video.png ' 
      ); Phantom. exit     (); },10000 );  

Save the above as Test.js
and then execute the command:./phantomjs–load-plugins=yes test.js
Check: The current directory is not video.png Ah, this is the flash play time.
At this point, you have completed the installation of PHANTOMJS
For more detailed information: http://www.ryanbridges.org/2013/05/21/ putting-the-flash-back-in-phantomjs/
Three, install Pyspider
and then get started, install Pyspider
1. Subsequent installation of the Pyspider we need to install the package through the PIP to install, so in this money we want to ensure that the system has been installed PIP, if not installed, now PIP source code, to install the
code can be downloaded from here to: https:// Pypi.python.org/packages/source/p/pip can also be downloaded from here to: http://download.csdn.net/detail/king_bingge/8582399
for installation

#解压tar -zxvf pip-6.1.1.tar.gzcd pip-6.1.1#安装python setup.py install这个时候会报错说少了setuptools从setuptools官网 https://pypi.python.org/pypi/setuptools下载setuptools原来#解压tar -zxvf setuptools-3.6.tar.gz cd setuptools-3.6#安装python setup.py install

2. After the installation pip is complete, execute the command

[root@localhost pyspider-master]# pip  install  -r requirements.txt 

3. After the execution

install pyspider

4. Run./run.py, visit http://192.168.1.1:5000/and see the following interface to show that you're done:

So far, the environment has been built.

A follow-up will explain an example: use this environment to implement a proposed crawler project.

Python+pyspider+phantomjs for simple crawler functions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.