Python crawler from Getting started to discarding (21) scrapy Distributed Deployment

Source: Internet
Author: User
Tags log log

According to the previous article we put the code to the remote host is through the copy or Git way, but if we consider the situation of many remote host, this way is more troublesome, that there is no easy way to do? Here you can actually pass the Scrapyd, here is the Scrapyd GitHub address: Https://github.com/scrapy/scrapyd

When Scrapyd is installed on the remote host and started, a Web service is started on the remote host, which defaults to port 6800, so that we can manage our Scrapy project through an interface via HTTP request, So there's no need to connect a copy over a computer through git, about Scrapyd Official document address: http://scrapyd.readthedocs.io/en/stable/

Installing Scrapyd

Installing SCRAPYD:PIP Install Scrapyd

Here I also install scrapy and scrapyd packages in another Ubuntu Linux virtual machine to ensure that the required packages are installed, so we have two Linux, including the Linux environment we already have in the previous article

Here is a small problem to note, the default Scrapyd boot is Scrapyd can be started directly, here bind bind IP address is 127.0.0.1 port is: 6800, here for other virtual machine access to speak IP address set to 0.0.0.0
Configuration file for Scrapyd:/usr/local/lib/python3.5/dist-packages/scrapyd/default_scrapyd.conf

So that we can access it through the browser:

About deployment

How to deploy the project through Scrapyd, where the official documentation provides an address: https://github.com/scrapy/scrapyd-client, which is done via scrapyd-client

The scrapyd-client here mainly implements the following:

    1. Package our native code to generate an egg file
    2. Upload to a remote server based on our configured URL

We configure the Scrapy.cfg configuration file in our local Scrapy project:

We can actually set the username and password, but there's no need to set the URL.
Here to set the URL must note: url = Http://192.168.1.9:6800/addversion.json
The last Addversion.json cannot be less

We install PIP install scrapy_client locally and execute after installation: Scrapyd-deploy

Zhaofandembp:zhihu_user zhaofan$ scrapyd-deploypacking version1502177138Deploying to Project"Zhihu_user" inchhttp//192.168.1.9:6800/addversion.jsonServer Response ( $):{"Node_name":"Fan-virtualbox","Status":"OK","version":"1502177138","Spiders":1,"Project":"Zhihu_user"}zhaofandembp:zhihu_user zhaofan$

See Status:200 said to have succeeded

About the common operations API

Listprojects.json List of uploaded items

 zhaofandembp:zhihu_user zhaofan$ Curl http:// 192.168.1.9:6800/listprojects.json  { " node_name   ": "  fan-virtualbox   ", "   Status   ": "   OK  , "   Projects   ": ["   Zhihu_user   " ]}zhaofandembp:zhihu_user zhaofan$  

Listversions.json A version of an uploaded item is listed

 zhaofandembp:zhihu_user zhaofan$ Curl http:// 192.168.1.9:6800/listversions.json\?project\=zhihu_user  { " node_name   ": "  fan-virtualbox   ", "   Status   ": "   OK  , "   Versions   ": ["   1502177138   " ]}zhaofandembp:zhihu_user zhaofan$  

Schedule.json Startup of remote tasks

The following three times we started it means we started three missions, three dispatch tasks to run the Zhihu crawler.

Zhaofandembp:zhihu_user zhaofan$ Curl http://192.168.1.9:6800/schedule.json-d project=zhihu_user-d Spider=zhihu{"Node_name":"Fan-virtualbox","Status":"OK","Jobid":"97f1b5027c0e11e7b07a080027bbde73"}zhaofandembp:zhihu_user zhaofan$ Curl http://192.168.1.9:6800/schedule.json-d project=zhihu_user-d Spider=zhihu{"Node_name":"Fan-virtualbox","Status":"OK","Jobid":"99595aa87c0e11e7b07a080027bbde73"}zhaofandembp:zhihu_user zhaofan$ Curl http://192.168.1.9:6800/schedule.json-d project=zhihu_user-d Spider=zhihu{"Node_name":"Fan-virtualbox","Status":"OK","Jobid":"9abb1ba27c0e11e7b07a080027bbde73"}zhaofandembp:zhihu_user zhaofan$

At the same time, when the boot is complete, we can view jobs through the page, because my remote server does not have Scrapy_redis installed, so the display task is complete, I open the log and can see the detailed log situation:

The reason for the error here is that I forgot to install the Scrapy_redis and Pymongo modules on the Ubuntu virtual machine.
Pip Install Scrapy_redis Pymongo is installed and restarted, you can see the tasks that are already running, and you can also see the crawled content by opening the log log:

Listjobs.json Lists all jobs assignments.
Above is a page to display all the tasks, here is the command to get results

Zhaofandembp:zhihu_user zhaofan$ Curl http://192.168.1.9:6800/listjobs.json\?project\=zhihu_user{"Node_name":"Fan-virtualbox","Status":"OK","Running": [],"pending": [],"finished": [{"start_time":"2017-08-08 15:53:00.510050","Spider":"Zhihu","ID":"97f1b5027c0e11e7b07a080027bbde73","End_time":"2017-08-08 15:53:01.416139"}, {"start_time":"2017-08-08 15:53:05.509337","Spider":"Zhihu","ID":"99595aa87c0e11e7b07a080027bbde73","End_time":"2017-08-08 15:53:06.627125"}, {"start_time":"2017-08-08 15:53:10.509978","Spider":"Zhihu","ID":"9abb1ba27c0e11e7b07a080027bbde73","End_time":"2017-08-08 15:53:11.542001"}]}zhaofandembp:zhihu_user zhaofan$

Cancel.json Cancel all running tasks
All jobs started here can be canceled:

Zhaofandembp:zhihu_user zhaofan$ Curl http://192.168.1.9:6800/cancel.json-d project=zhihu_user-d job=0f5cdabc7c1011e7b07a080027bbde73{"Node_name":"Fan-virtualbox","Status":"OK","prevstate":"Running"}zhaofandembp:zhihu_user zhaofan$ Curl http://192.168.1.9:6800/cancel.json-d project=zhihu_user-d job=63f8e12e7c1011e7b07a080027bbde73{"Node_name":"Fan-virtualbox","Status":"OK","prevstate":"Running"}zhaofandembp:zhihu_user zhaofan$ Curl http://192.168.1.9:6800/cancel.json-d project=zhihu_user-d job=63f8e12f7c1011e7b07a080027bbde73{"Node_name":"Fan-virtualbox","Status":"OK","prevstate":"Running"}

So when we look through the page again, we can see that all the tasks are finshed states:

I'm sure you'll find it really inconvenient to look at the above methods. And it takes so long to enter, so someone did a good deed for you. To encapsulate these APIs again: Https://github.com/djm/python-scrapyd-api

About PYTHON-SCRAPYD-API

This module allows us to do the above APIs directly in Python code.
First install the module: Pip install Python-scrapyd-api
Here's how to use this, just a simple example, and the other way to do it is simply to follow the rules:

 from Import  = Scrapydapi ('http://192.168.1.9:6800'== scrapyd.list_jobs ('zhihu_user') Print (RES) Print (Res2)

Cancel a scheduled job
Scrapyd.cancel (' project_name ', ' 14a6599ef67111e38a0e080027880ca6 ')

Delete a project and all sibling versions
Scrapyd.delete_project (' project_name ')

Delete a version of a project
Scrapyd.delete_version (' project_name ', ' version_name ')

Request status of a job
Scrapyd.job_status (' project_name ', ' 14a6599ef67111e38a0e080027880ca6 ')

List All jobs Registered
Scrapyd.list_jobs (' project_name ')

List All projects Registered
Scrapyd.list_projects ()

List all spiders available to a given project
Scrapyd.list_spiders (' project_name ')

List all versions registered to a given project
Scrapyd.list_versions (' project_name ')

Schedule a job to run with a specific spider
Scrapyd.schedule (' project_name ', ' spider_name ')

Schedule a job to run while passing override settings
Settings = {' Download_delay ': 2}

Schedule a job to run and passing extra attributes to Spider initialisation
Scrapyd.schedule (' project_name ', ' spider_name ', extra_attribute= ' value ')

Python crawler from Getting started to discarding (21) scrapy Distributed Deployment

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.