Pyspider Introduction
Pyspider is a powerful web crawler system written by a nation with powerful WebUI. Written in Python, distributed architecture, supports a variety of database backend, powerful WebUI support Script Editor, Task Monitor, project manager and result viewer.
Pyspider comes from a previously done crawler backend used by a vertical search engine. We need from 200 sites (due to the site failure, not all at the same time, there are 100+ in the run) to collect data, and require in 5 minutes to update the other site updates to the library. Therefore, flexible crawl control is a must.
At the same time, due to 100 sites, every day there may be a site failure or revision, so you need to be able to monitor the template invalidation, as well as view crawl status.
To achieve the 5-minute update, we use the last update time to crawl the most recent update page to determine if the page needs to be crawled again.
Visible, this project for crawler monitoring and scheduling requirements are very high.
Pyspider Chinese web: http://www.pyspider.cn
Pyspider Official website: http://docs.pyspider.org
Pyspider Demo: http://demo.pyspider.org
Pyspider Source: Https://github.com/binux/pyspider
Pyspider characteristics
- Python script control, you can use any HTML parsing package you like (built-in pyquery)
- WEB interface Write debug scripts, start and stop scripts, monitor execution status, view activity history, get results output
- Data storage supports MySQL, MongoDB, Redis, SQLite, ElasticSearch; PostgreSQL and SQLAlchemy
- The queue service supports RABBITMQ, Beanstalk, Redis, Kombu
- Support for crawling JavaScript pages
- component replaceable, supports standalone/distributed deployment, supports Docker deployment
- Powerful scheduling control, support for time-out crawls and priority setting
- Support for Python 2.{ 6, 7}, 3. {3, 4, 5, 6}
Pyspider Installation
1) PIP Installation
Pip can be run in the following versions of CPython: 2.6, 2.7, 3.1, 3.2, 3.3, 3.4, and PyPy.
Pip can be run in Unix/linux, Mac OS X, and Windows systems.
A) Script installation
Python get-pip.py
If Setuptools (or distribute) is not installed, get-pip.py
it will automatically install for you setuptools
If you need to upgrade setuptools (or distribute), runpip install -U setuptools
b) Command installation
sudo apt-get install python-pip //Debian, Ubuntu
sudo yum install python-pip //CentOS, Redhat, Fedora
2) PHANTOMJS Installation
PHANTOMJS is a WebKit-based server-side JavaScript API. It fully supports the web without browser support, and its fast, native support for a variety of Web standards: DOM processing, CSS selectors, JSON, Canvas, and SVG. PHANTOMJS can be used for page automation, network monitoring, web screen screenshots, and no interface testing. Supports multiple operating systems such as Windows, Linux, Mac os x.
PHANTOMJS Download: http://phantomjs.org/download.html
PHANTOMJS does not need to install, after decompression, configuration environment variables, it can be used directly, see PHANTOMJS Installation and development
PHANTOMJS installation command:
sudo apt-get install PHANTOMJS //Debian, Ubuntu
sudo pkg install phantomjs //FreeBSD
Brew Install PHANTOMJS //Mac OS X
3) Pyspider Installation
Pyspider Installed Dependency Package Requirements.txt
flask>=0.10jinja2>=2.7chardet>=2.2cssselect>=0.9lxmlpycurlpyqueryrequests>=2.2tornado>= 3.2mysql-connector-python>=1.2.2pika>=0.9.14pymongo>=2.7.2unittest2>=0.5.1flask-login>= 0.2.11u-msgpack-python>=1.6click>=3.3sqlalchemy>=0.9.7six>=1.5.0amqp>=1.3.0,< 2.0redisredis-py-clusterkombupsycopg2elasticsearchtblib
Pyspider installation command:
Pip Install Pyspider
Ubuntu users, please install the following Support class library in advance:
sudo apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev Libxslt1-dev Python-lxml
4) Verify the installation is successful
Console input Command:
Pyspider All
accessing http://localhost:5000 using a browser
Normal appearance Pyspider page, that proves everything OK
Pyspider Example
1) Example 1: Crawl Technology Home (mimvp.com)
#!/usr/bin/env python#-*-encoding:utf-8-*-# Created on 2017-07-28 13:44:53# project:pyspiderdemo# mimvp.comfrom Pyspi Der.libs.base_handler Import *class Handler (Basehandler): crawl_config = { } @every (minutes=24 *) def on_start (self): self.crawl (' mimvp.com ', callback=self.index_page) @config (age=10 * *) def Index_page (self, Response): For each in Response.doc (' a[href^= "http"] "). Items (): self.crawl (each.attr.href , callback=self.detail_page) @config (priority=2) def detail_page (self, Response): return { "url ": Response.url, " title ": Response.doc (' title '). Text (), }
Operation Result:
2) Example 2: Set up an agent crawl Web page
Pyspider supports crawling Web pages using proxies in two ways:
Mode 1:
--phantomjs-proxy TEXT PHANTOMJS Proxy Ip:port
Start commands such as:
Pyspider--phantomjs-proxy "188.226.141.217:8080" all
Mode 2:
Set proxy global variables, such as:
Crawl_config = {
' Proxy ': ' 188.226.141.217:8080 '
}
Example code:
#!/usr/bin/env python#-*-encoding:utf-8-*-# Created on 2017-07-28 14:13:14# project:mimvp_proxy_pyspider## MIMVP.COMF Rom Pyspider.libs.base_handler import *class handler (Basehandler): crawl_config = { ' proxy ': '/HTTP// 188.226.141.217:8080 ', # http ' proxy ': ' https://182.253.32.65:3128 ' # HTTPS } @every (minutes= def on_start (self): self.crawl (' http://proxy.mimvp.com/exist.php ', callback=self.index_page) @config (age=10 *) def index_page (self, Response): For each in Response.doc (' a[href^= "http"] "). Item S (): self.crawl (Each.attr.href, Callback=self.detail_page) @config (priority=2) def detail_page ( Self, Response): return { "url": Response.url, "title": Response.doc (' title '). Text (), }
Operation Result:
Python Pyspider Installation and development