Python + pyspider + phantomjs implements simple crawling and pyspiderphantomjs

Source: Internet
Author: User

Python + pyspider + phantomjs implements simple crawling and pyspiderphantomjs

This article has two purposes:
1. Record the process of building a crawler Environment
2. summarize the experiences of the crawler Project

I. System Environment
This solution is tested on 32-bit ubuntu10.04 and 64-bit centos6.9. the required software is as follows:
1. Choose ubuntu10.04 or centos6.9. The following describes how to use centos6.9.
2. pyspider source code, which can be downloaded to idea
3. for phantomjs, flash is used in this crawler project, while phantomjs in the latest version removes the support of flash plug-ins (of course, this is to improve the efficiency of phantomjs ), so I chose the phantomjs version that retains the flash plug-in, which can be downloaded from here to the http://download.csdn.net/detail/king_bingge/8582351
4. Of course, since we use pyspider, your system must support python, And the python version of my system is 2.7.7, you can download to the http://download.csdn.net/detail/king_bingge/8582267 from here
5. To enable phantomjs to support flash, you also need to download a flash plug-in. For falsh plug-in, you can download it from here to http://download.csdn.net/detail/king_bingge/8582273 ., Of course, the official website may be a better choice.

By now, all the major software we need is ready. We will start to build the environment.

Ii. Install phantomjs

1. Install the dependent environment before installing phantomjs.

yum install -y xorg-x11-server-Xvfb xorg-x11-server-Xorg xorg-x11-fonts* dbus-x11 xulrunner.x86_64 nspr.x86_64 nss.x86_64yum install -y  flash-plugin nspluginwrapper

2. then execute the rpm-ivh phantomjs-1.9-1.x86_64.rpm, after completion, execute the following:

[root@localhost pyspider]# phantomjs  -v1.10.0 (development)[root@localhost pyspider]# 

View version information

3. Run the following command to view the command line parameters:

[root@localhost pyspider]# phantomjs  -hUsage:   phantomjs [switchs] [options] [script] [argument [argument [...]]]Options:  --cookies-file=<val>                 Sets the file name to store the persistent cookies  --config=<val>                       Specifies JSON-formatted configuration file  --debug=<val>                        Prints additional warning and debug message: 'true' or 'false' (default)  --disk-cache=<val>                   Enables disk cache: 'true' or 'false' (default)  --ignore-ssl-errors=<val>            Ignores SSL errors (expired/self-signed certificate errors): 'true' or 'false' (default)  --load-images=<val>                  Loads all inlined images: 'true' (default) or 'false'  --load-plugins=<val>                 Loads all plugins: 'true' (default) or 'false'

We can see that there is a-load-plugins = parameter, which is the parameter required by the plug-in our house.
4. Do not forget that our flash plug-in has not been installed.
Decompress: tar-xf install_flash_player_11_linux.x86_64.tar.gz
Cp libflashplayer. so/usr/lib/mozilla/plugins
Cp./usr/*/usr
In this way, the installation is successful. You can use adobe flash player.

5. How can we test whether our installed phantomjs is successful? The following is a test script:

var page = require('webpage').create();page.open('http://www.dhs.state.il.us/accessibility/tests/flash/video.html', function () {    window.setTimeout(function(){        page.render('video.png');        phantom.exit();    },10000);});

Save the preceding content as test. js.
Then execute the command:./phantomjs-load-plugins = yes test. js
Check it out: the video o.png is not in the front Directory, which is used for flash playback.
Now you have installed phantomjs.
For more details, refer to: http://www.ryanbridges.org/2013/05/21/putting-the-flash-back-in-phantomjs/
3. Install pyspider
Next, let's get started. Install pyspider.
1. in the future, all the packages required for installing pyspider will be installed through pip. Therefore, we need to ensure that pip has been installed in the system at this cost. If pip is not installed, install pip source code now.
The code can be downloaded from: keystore
Install

# Unzip tar-zxvf pip-6.1.1.tar.gzcd pip-6.1.1 # install python setup. py install this time will report an error saying less setuptools from the setuptools official website development-zxvf setuptools-3.6.tar.gz cd setuptools-3.6 # install python setup. py install

2. After pip is installed, run the following command:

[root@localhost pyspider-master]# pip  install  -r requirements.txt 

3. Then execute

pip install pyspider

4. run./run. py and access http: // 192.168.1.1: 5000/. You can see the following interface to complete the process:

So far, the environment has been set up.

An example will be explained later: use this environment to implement a suggested crawler project.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.