51 Python distributed crawler build search engine scrapy explaining-scrapyd deploy Scrapy project

Source: Internet
Author: User

The Scrapyd module is dedicated to deploying scrapy projects and can deploy and manage scrapy projects

: Https://github.com/scrapy/scrapyd

Recommended installation

PIP3 Install Scrapyd

Install the scrapyd module First, after installation in the Python installation directory in the Scripts folder will generate Scrapyd.exe boot file, if the file exists that the installation is successful, we can execute the command

Start the scrapyd service

In command input: Scrapyd  

Instructions to start successfully, close or exit the command window, because we are actually using the boot directory to start the service in the specified

Specifies that the service is started after The service directory is started

Reopen the command, after the CD enters the directory to specify the service, execute the command scrapyd start the service

At this point you can see that the DBS directory is generated in the startup directory

The DBS catalogue is empty, nothing.

At this point we need to install the Scrapyd-client module

The Scrapyd-client module is specialized in packaging scrapy crawler projects into scrapyd services.

Download Catalog: Https://github.com/scrapy/scrapyd-client

Recommended installation

PIP3 Install Scrapyd-client

After installation, a scrapyd-deploy no suffix file is generated in the Scripts folder in the Python installation directory, if this file indicates that the installation was successful

Key Note: This scrapyd-deploy no suffix file is the boot file, under the Linux system can travel, under Windows is not able to travel, so we need to edit so that it can travel in windows

In this directory to create a new Scrapyd-deploy.bat file, note that the name must be the same as Scrapyd-deploy, we edit this bat file so that it can travel in Windows

Scrapyd-deploy.bat file Editing

Set python execution file path and scrapyd-deploy no suffix file path

@echo off "C:\Users\admin\AppData\Local\Programs\Python\Python35\python.exe" "C:\Users\admin\AppData\Local\ Programs\python\python35\scripts\scrapyd-deploy "%1%2%3%4%5%6%7%8%9

After the Scrapyd-deploy.bat file is edited, open the Command Window CD to the directory in the Scrapy project that has the Scrapy.cfg file, and then execute the scrapyd-deploy command to see what we edited Whether the Scrapyd-deploy.bat file can be executed

If a representation can be performed

Set the Scrapy.cfg file in the Scrapy project, this file is for scrapyd-deploy use

scrapy.cfg File

Note: The following Chinese notes can not be written in the inside, or it will error, this write notes is just convenient to know how to set

# automatically created By:scrapy startproject## for more information on the [Deploy] section see:# https://scrapyd.re Adthedocs.org/en/latest/deploy.html[settings]default = Adc.settings[deploy:bobby]                      #设置部署名称bobbyurl =/http Localhost:6800/        #开启urlproject = ADC                       #项目名称

Command Window Input: Scrapyd-deploy-l start service, you can see the deployment name we set

Before you start packing, execute a command: Scrapy list, this command executes the success instructions can be packaged, if not executed the success of the description of the work is not completed

Note that there is a good chance that an error will occur when executing the scrapy List command, and if python cannot find the scrapy project, you need to A Python-aware path is set in the settings.py configuration file in the Scrapy project

# Add the current project's first-level directory ADC directory to Python to identify the directory Base_dir = Os.path.dirname (Os.path.abspath (Os.path.dirname (__file__))) Sys.path.insert (0, Os.path.join (base_dir, ' ADC '))

If the error prompts, what the remote computer rejects, stating that your scrapy Project has a link to the remote computer, such as linked databases or elasticsearch (search engines), you need to start the linked server

The execute scrapy List command returns the crawler name stating everything OK, such as

So we can start packing the scrapy project to Scrapyd, and use the command to package it with the Scrapy.cfg file settings in the Scrapy project .

Scrapy.cfg file

# automatically created By:scrapy startproject## for more information on the [Deploy] section see:# https://scrapyd.re Adthedocs.org/en/latest/deploy.html[settings]default = Adc.settings[deploy:bobby]                      #设置部署名称bobbyurl =/http Localhost:6800/        #开启urlproject = ADC                       #项目名称

Execute Package Command: Scrapyd-deploy deployment name-P project name

Example: Scrapyd-deploy bobby-p ADC

The following display indicates that the Scrapy project was successfully packaged

Description of Scrapy Project after package success

The Scrapy project will generate the appropriate files in the directory where the SCRAPYD starts the service after it is packaged successfully , as follows:

1 . The Scrapy project name will be generated in the DBS folder under the Scrapyd Startup service directory. DB

2. The eggs folder in the Scrapyd Startup service directory generates scrapy Project name folder, which is a Scrapyd-deploy package generated name. Egg

3, the scrapy Crawler project will be packaged, in the Scrapy project will generate two folders, build folder and Project.egg-info folder

The build folder is a packaged reptile project, and the scrapyd after that is thepackaged item .

Project.egg-info folder is a package of some configuration

Description:Scrapyd-deploy is only responsible for the Scrapy Crawler Project package to SCRAPYD deployment, only need to package once, packaging, after the start crawler, stop crawler and other Scrapy project management by Scrapyd to finish.

strong> scrapyd admin scrapy Items

strong> Note: scrapyd Manage the  curl command, The Curl command does not support Windows systems and supports only Linux systems, so we use Cmder to execute commands under the Windows system

1, travel reptiles, travel designated Scrapy under the specified crawler

Curl http://localhost:6800/schedule.json-d project=scrapy Project name-D spider= crawler name such as: Curl Http://localhost:6800/ schedule.json-d project=adc-d Spider=lagou

2. Stop the crawler

Curl http://localhost:6800/cancel.json-d project=scrapy Project name-D job= excursion ID such as: Curl http://localhost:6800/cancel.json-d Project=adc-d job=5454948c93bf11e7af0040167eb10a7b

3. Delete Scrapy Project

Note: The general deletion of scrapy items requires the execution of a command to stop the crawler under the project

Deleting an item deletes the eggs folder in the directory under the Scrapyd startup service to generate the egg file and needs to be re- packaged with Scrapyd-deploy to run again

Curl http://localhost:6800/delproject.json-d project=scrapy Project name if: Curl http://localhost:6800/delproject.json-d Project=adc

4. See how many scrapy items are in the API

Curl Http://localhost:6800/listprojects.json

5. See how many crawlers are in a specified Scrapy project

Curl Http://localhost:6800/listspiders.json?project=scrapy project name such as: Curl HTTP://LOCALHOST:6800/LISTSPIDERS.JSON?PROJECT=ADC

Introduction to SCRAPYD supported APIs

Scrapyd supports a series of APIs, which are described below with a py file

#-*-Coding:utf-8-*-import requestsimport json baseUrl = ' http://127.0.0.1:6800/' Daemurl = ' http://127.0.0.1:6800/ Daemonstatus.json ' Listprourl = ' http://127.0.0.1:6800/listprojects.json ' Listspdurl = ' http://127.0.0.1:6800/ listspiders.json?project=%s ' listspdvurl= ' http://127.0.0.1:6800/listversions.json?project=%s ' Listjoburl = ' http:/ /127.0.0.1:6800/listjobs.json?project=%s ' delspdvurl= ' Http://127.0.0.1:6800/delversion.json ' #http:// 127.0.0.1:6800/daemonstatus.json# View Scrapyd server running status r= requests.get (daemurl) print ' 1.stats: \ n%s \ n '%r.text #http:// 127.0.0.1:6800/listprojects.json# get the list of projects that have been published on the SCRAPYD server r= requests.get (listprourl) print ' 1.1.listprojects: [%s]\n \ n '%r.textif len (json.loads (R.text) ["Projects"]) >0:project = json.loads (R.text) ["Projects"][0] #http:// 127.0.0.1:6800/listspiders.json?project=myproject# Get the crawler list listspd=listspd under the project named MyProject on the SCRAPYD server projectr= Requests.get (listspdurl) print ' 2.listspiders: [%s]\n\n '%r.text if Json.loads (R.text). Has_key ("Spiders") >0 : Spider =json.loads (r.text) ["Spiders"][0] #http://127.0.0.1:6800/listversions.json?project=myproject## Gets the version listspdvurl=listspdvurl% projectr = Requests.get (listspdvurl) print ' 3 for each crawler under the project named MyProject on the Scrapyd server. Listversions: [%s]\n\n '%rtext If Len (json.loads (R.text) ["Versions"]) >0:version = json.loads (r.text) ["Versions"] [0] #http://127.0.0.1:6800/listjobs.json?project=myproject# gets a list of all tasks on the SCRAPYD server, including the end, running, and ready to start. Listjoburl=listjoburl% Pronamer=requests.get (listjoburl) print ' 4.listjobs: [%s]\n\n '%r.text #schedule. json#http:// 127.0.0.1:6800/schedule.json-d project=myproject-d spider=myspider# Start the MyProject crawler under Myspider Engineering on the SCRAPYD server, Make Myspider start running immediately, note that you must post Schurl = BaseURL + ' Schedule.json ' dictdata ={"project":p roject, "spider": spider}r= Reqeusts.post (Schurl, json= dictdata) print ' 5.1.delversion: [%s]\n\n '%r.text #http://127.0.0.1:6800/delversion.json -D project=myproject-d Version=r99 ' #删除scrapyd服务器上myproject的工程下的版本名为version的爬虫, note must be post Delverurl = BaseURL + ' DelVersion.json ' dictdata={"project":p roject, "version": Version}r= Reqeusts.post (Delverurl, json= dictdata) print ' 6.1. Delversion: [%s]\n\n '%r.text #http://127.0.0.1:6800/delproject.json-d project=myproject# Remove scrapyd on the server project, Note that the command automatically removes all spiders under the project, note that the Delprourl = BaseURL + ' Delproject.json ' dictdata={"project" must be in post mode:p roject}r=  Reqeusts.post (Delverurl, json= dictdata) print ' 6.2.delproject: [%s]\n\n '%r.text

Summarize: 1, get status
Http://127.0.0.1:6800/daemonstatus.json2, getting a list of items
Http://127.0.0.1:6800/listprojects.json3, get the list of published crawlers under the project
HTTP://127.0.0.1:6800/LISTSPIDERS.JSON?PROJECT=MYPROJECT4, get list of released crawler versions under Project
HTTP://127.0.0.1:6800/LISTVERSIONS.JSON?PROJECT=MYPROJECT5, get crawler running state
Http://127.0.0.1:6800/listjobs.json?project=myproject
6. Start a crawler on the server (must be a crawler that has been published to the server)
Http://localhost:6800/schedule.json (post mode, data={"project": MyProject, "Spider": Myspider}) 7, delete a version of the crawler
Http://127.0.0.1:6800/delversion.json (post mode, data={"project": MyProject, "version": Myversion}) 8, delete a project, Includes various versions of Crawler Http://127.0.0.1:6800/delproject.json (post mode, data={"project": MyProject}) under this project

Here, Scrapyd-based crawler release tutorial is finished.

Some people may say, I directly with scrapy Cwal command also can execute crawler, personal understanding with Scrapyd Server Management crawler, at least have the following several advantages:

1, can avoid the crawler source is seen.

2, there is version control.

3, can be remote start, stop, delete, it is because of this, so Scrapyd is also a distributed crawler solution.

51 Python distributed crawler build search engine scrapy explaining-scrapyd deploy Scrapy project

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.