Python Pyspider Installation and development

Last Update:2017-08-16 Source: Internet

Author: User

Tags python script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Pyspider Introduction

Pyspider is a powerful web crawler system written by a nation with powerful WebUI. Written in Python, distributed architecture, supports a variety of database backend, powerful WebUI support Script Editor, Task Monitor, project manager and result viewer.

Pyspider comes from a previously done crawler backend used by a vertical search engine. We need from 200 sites (due to the site failure, not all at the same time, there are 100+ in the run) to collect data, and require in 5 minutes to update the other site updates to the library. Therefore, flexible crawl control is a must.

At the same time, due to 100 sites, every day there may be a site failure or revision, so you need to be able to monitor the template invalidation, as well as view crawl status.

To achieve the 5-minute update, we use the last update time to crawl the most recent update page to determine if the page needs to be crawled again.

Visible, this project for crawler monitoring and scheduling requirements are very high.

Pyspider Chinese web: http://www.pyspider.cn

Pyspider Official website: http://docs.pyspider.org

Pyspider Demo: http://demo.pyspider.org

Pyspider Source: Https://github.com/binux/pyspider

Pyspider characteristics

Python script control, you can use any HTML parsing package you like (built-in pyquery)
WEB interface Write debug scripts, start and stop scripts, monitor execution status, view activity history, get results output
Data storage supports MySQL, MongoDB, Redis, SQLite, ElasticSearch; PostgreSQL and SQLAlchemy
The queue service supports RABBITMQ, Beanstalk, Redis, Kombu
Support for crawling JavaScript pages
component replaceable, supports standalone/distributed deployment, supports Docker deployment
Powerful scheduling control, support for time-out crawls and priority setting
Support for Python 2.{ 6, 7}, 3. {3, 4, 5, 6}

Pyspider Installation

1) PIP Installation

Pip can be run in the following versions of CPython: 2.6, 2.7, 3.1, 3.2, 3.3, 3.4, and PyPy.

Pip can be run in Unix/linux, Mac OS X, and Windows systems.

A) Script installation

Python get-pip.py

If Setuptools (or distribute) is not installed, get-pip.py it will automatically install for you setuptools

If you need to upgrade setuptools (or distribute), runpip install -U setuptools

b) Command installation

sudo apt-get install python-pip //Debian, Ubuntu

sudo yum install python-pip //CentOS, Redhat, Fedora

2) PHANTOMJS Installation

PHANTOMJS is a WebKit-based server-side JavaScript API. It fully supports the web without browser support, and its fast, native support for a variety of Web standards: DOM processing, CSS selectors, JSON, Canvas, and SVG. PHANTOMJS can be used for page automation, network monitoring, web screen screenshots, and no interface testing. Supports multiple operating systems such as Windows, Linux, Mac os x.

PHANTOMJS Download: http://phantomjs.org/download.html

PHANTOMJS does not need to install, after decompression, configuration environment variables, it can be used directly, see PHANTOMJS Installation and development

PHANTOMJS installation command:

sudo apt-get install PHANTOMJS //Debian, Ubuntu

sudo pkg install phantomjs //FreeBSD

Brew Install PHANTOMJS //Mac OS X

3) Pyspider Installation

Pyspider Installed Dependency Package Requirements.txt

flask>=0.10jinja2>=2.7chardet>=2.2cssselect>=0.9lxmlpycurlpyqueryrequests>=2.2tornado>= 3.2mysql-connector-python>=1.2.2pika>=0.9.14pymongo>=2.7.2unittest2>=0.5.1flask-login>= 0.2.11u-msgpack-python>=1.6click>=3.3sqlalchemy>=0.9.7six>=1.5.0amqp>=1.3.0,< 2.0redisredis-py-clusterkombupsycopg2elasticsearchtblib

Pyspider installation command:

Pip Install Pyspider

Ubuntu users, please install the following Support class library in advance:

sudo apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev Libxslt1-dev Python-lxml

4) Verify the installation is successful

Console input Command:

Pyspider All

accessing http://localhost:5000 using a browser

Normal appearance Pyspider page, that proves everything OK

Pyspider Example

1) Example 1: Crawl Technology Home (mimvp.com)

#!/usr/bin/env python#-*-encoding:utf-8-*-# Created on 2017-07-28 13:44:53# project:pyspiderdemo# mimvp.comfrom Pyspi Der.libs.base_handler Import *class Handler (Basehandler):    crawl_config = {    }    @every (minutes=24 *)    def on_start (self):        self.crawl (' mimvp.com ', callback=self.index_page)    @config (age=10 * *)    def Index_page (self, Response): For each in        Response.doc (' a[href^= "http"] "). Items ():            self.crawl (each.attr.href , callback=self.detail_page)    @config (priority=2)    def detail_page (self, Response):        return {            "url ": Response.url,            " title ": Response.doc (' title '). Text (),        }

Operation Result:

2) Example 2: Set up an agent crawl Web page

Pyspider supports crawling Web pages using proxies in two ways:

Mode 1:

--phantomjs-proxy TEXT PHANTOMJS Proxy Ip:port

Start commands such as:

Pyspider--phantomjs-proxy "188.226.141.217:8080" all

Mode 2:

Set proxy global variables, such as:

Crawl_config = {
' Proxy ': ' 188.226.141.217:8080 '
}

Example code:

#!/usr/bin/env python#-*-encoding:utf-8-*-# Created on 2017-07-28 14:13:14# project:mimvp_proxy_pyspider## MIMVP.COMF Rom Pyspider.libs.base_handler import *class handler (Basehandler):    crawl_config = {        ' proxy ': '/HTTP// 188.226.141.217:8080 ',     # http        ' proxy ': ' https://182.253.32.65:3128 '      # HTTPS    }    @every (minutes=    def on_start (self):        self.crawl (' http://proxy.mimvp.com/exist.php ', callback=self.index_page)    @config (age=10 *)    def index_page (self, Response): For each in        Response.doc (' a[href^= "http"] "). Item S ():            self.crawl (Each.attr.href, Callback=self.detail_page)    @config (priority=2)    def detail_page ( Self, Response):        return {            "url": Response.url,            "title": Response.doc (' title '). Text (),        }

Operation Result:

Python Pyspider Installation and development

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More