Python Pyspider Installation and development

Source: Internet
Author: User
Tags python script

Pyspider Introduction

Pyspider is a powerful web crawler system written by a nation with powerful WebUI. Written in Python, distributed architecture, supports a variety of database backend, powerful WebUI support Script Editor, Task Monitor, project manager and result viewer.

Pyspider comes from a previously done crawler backend used by a vertical search engine. We need from 200 sites (due to the site failure, not all at the same time, there are 100+ in the run) to collect data, and require in 5 minutes to update the other site updates to the library. Therefore, flexible crawl control is a must.

At the same time, due to 100 sites, every day there may be a site failure or revision, so you need to be able to monitor the template invalidation, as well as view crawl status.

To achieve the 5-minute update, we use the last update time to crawl the most recent update page to determine if the page needs to be crawled again.

Visible, this project for crawler monitoring and scheduling requirements are very high.

Pyspider Chinese web: http://www.pyspider.cn

Pyspider Official website: http://docs.pyspider.org

Pyspider Demo: http://demo.pyspider.org

Pyspider Source: Https://github.com/binux/pyspider

Pyspider characteristics

    • Python script control, you can use any HTML parsing package you like (built-in pyquery)
    • WEB interface Write debug scripts, start and stop scripts, monitor execution status, view activity history, get results output
    • Data storage supports MySQL, MongoDB, Redis, SQLite, ElasticSearch; PostgreSQL and SQLAlchemy
    • The queue service supports RABBITMQ, Beanstalk, Redis, Kombu
    • Support for crawling JavaScript pages
    • component replaceable, supports standalone/distributed deployment, supports Docker deployment
    • Powerful scheduling control, support for time-out crawls and priority setting
    • Support for Python 2.{ 6, 7}, 3. {3, 4, 5, 6}

Pyspider Installation

1) PIP Installation

Pip can be run in the following versions of CPython: 2.6, 2.7, 3.1, 3.2, 3.3, 3.4, and PyPy.

Pip can be run in Unix/linux, Mac OS X, and Windows systems.

A) Script installation

Python get-pip.py

If Setuptools (or distribute) is not installed, get-pip.py it will automatically install for you setuptools

If you need to upgrade setuptools (or distribute), runpip install -U setuptools

b) Command installation

sudo apt-get install python-pip //Debian, Ubuntu

sudo yum install python-pip //CentOS, Redhat, Fedora

2) PHANTOMJS Installation

PHANTOMJS is a WebKit-based server-side JavaScript API. It fully supports the web without browser support, and its fast, native support for a variety of Web standards: DOM processing, CSS selectors, JSON, Canvas, and SVG. PHANTOMJS can be used for page automation, network monitoring, web screen screenshots, and no interface testing. Supports multiple operating systems such as Windows, Linux, Mac os x.

PHANTOMJS Download: http://phantomjs.org/download.html

PHANTOMJS does not need to install, after decompression, configuration environment variables, it can be used directly, see PHANTOMJS Installation and development

PHANTOMJS installation command:

sudo apt-get install PHANTOMJS //Debian, Ubuntu

sudo pkg install phantomjs //FreeBSD

Brew Install PHANTOMJS //Mac OS X

3) Pyspider Installation

Pyspider Installed Dependency Package Requirements.txt

flask>=0.10jinja2>=2.7chardet>=2.2cssselect>=0.9lxmlpycurlpyqueryrequests>=2.2tornado>= 3.2mysql-connector-python>=1.2.2pika>=0.9.14pymongo>=2.7.2unittest2>=0.5.1flask-login>= 0.2.11u-msgpack-python>=1.6click>=3.3sqlalchemy>=0.9.7six>=1.5.0amqp>=1.3.0,< 2.0redisredis-py-clusterkombupsycopg2elasticsearchtblib

Pyspider installation command:

Pip Install Pyspider

Ubuntu users, please install the following Support class library in advance:

sudo apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev Libxslt1-dev Python-lxml

4) Verify the installation is successful

Console input Command:

Pyspider All

accessing http://localhost:5000 using a browser

Normal appearance Pyspider page, that proves everything OK

Pyspider Example

1) Example 1: Crawl Technology Home (mimvp.com)

#!/usr/bin/env python#-*-encoding:utf-8-*-# Created on 2017-07-28 13:44:53# project:pyspiderdemo# mimvp.comfrom Pyspi Der.libs.base_handler Import *class Handler (Basehandler):    crawl_config = {    }    @every (minutes=24 *)    def on_start (self):        self.crawl (' mimvp.com ', callback=self.index_page)    @config (age=10 * *)    def Index_page (self, Response): For each in        Response.doc (' a[href^= "http"] "). Items ():            self.crawl (each.attr.href , callback=self.detail_page)    @config (priority=2)    def detail_page (self, Response):        return {            "url ": Response.url,            " title ": Response.doc (' title '). Text (),        }

Operation Result:

2) Example 2: Set up an agent crawl Web page

Pyspider supports crawling Web pages using proxies in two ways:

Mode 1:

--phantomjs-proxy TEXT PHANTOMJS Proxy Ip:port

Start commands such as:

Pyspider--phantomjs-proxy "188.226.141.217:8080" all

Mode 2:

Set proxy global variables, such as:

Crawl_config = {
' Proxy ': ' 188.226.141.217:8080 '
}

Example code:

#!/usr/bin/env python#-*-encoding:utf-8-*-# Created on 2017-07-28 14:13:14# project:mimvp_proxy_pyspider## MIMVP.COMF Rom Pyspider.libs.base_handler import *class handler (Basehandler):    crawl_config = {        ' proxy ': '/HTTP// 188.226.141.217:8080 ',     # http        ' proxy ': ' https://182.253.32.65:3128 '      # HTTPS    }    @every (minutes=    def on_start (self):        self.crawl (' http://proxy.mimvp.com/exist.php ', callback=self.index_page)    @config (age=10 *)    def index_page (self, Response): For each in        Response.doc (' a[href^= "http"] "). Item S ():            self.crawl (Each.attr.href, Callback=self.detail_page)    @config (priority=2)    def detail_page ( Self, Response):        return {            "url": Response.url,            "title": Response.doc (' title '). Text (),        }

Operation Result:

 

Python Pyspider Installation and development

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.