The Scrapyd module is dedicated to deploying scrapy projects and can deploy and manage scrapy projects
: Https://github.com/scrapy/scrapyd
Recommended installation
Install the scrapyd module First, after installation in the Python installation directory in the Scripts folder will generate Scrapyd.exe boot file, if the file exists that the installation is successful, we can execute the command
Start the scrapyd service
In command input: Scrapyd
Instructions to start successfully, close or exit the command window, because we are actually using the boot directory to start the service in the specified
Specifies that the service is started after The service directory is started
Reopen the command, after the CD enters the directory to specify the service, execute the command scrapyd start the service
At this point you can see that the DBS directory is generated in the startup directory
The DBS catalogue is empty, nothing.
At this point we need to install the Scrapyd-client module
The Scrapyd-client module is specialized in packaging scrapy crawler projects into scrapyd services.
Download Catalog: Https://github.com/scrapy/scrapyd-client
Recommended installation
PIP3 Install Scrapyd-client
After installation, a scrapyd-deploy no suffix file is generated in the Scripts folder in the Python installation directory, if this file indicates that the installation was successful
Key Note: This scrapyd-deploy no suffix file is the boot file, under the Linux system can travel, under Windows is not able to travel, so we need to edit so that it can travel in windows
In this directory to create a new Scrapyd-deploy.bat file, note that the name must be the same as Scrapyd-deploy, we edit this bat file so that it can travel in Windows
Scrapyd-deploy.bat file Editing
Set python execution file path and scrapyd-deploy no suffix file path
@echo off "C:\Users\admin\AppData\Local\Programs\Python\Python35\python.exe" "C:\Users\admin\AppData\Local\ Programs\python\python35\scripts\scrapyd-deploy "%1%2%3%4%5%6%7%8%9
After the Scrapyd-deploy.bat file is edited, open the Command Window CD to the directory in the Scrapy project that has the Scrapy.cfg file, and then execute the scrapyd-deploy command to see what we edited Whether the Scrapyd-deploy.bat file can be executed
If a representation can be performed
Set the Scrapy.cfg file in the Scrapy project, this file is for scrapyd-deploy use
scrapy.cfg File
Note: The following Chinese notes can not be written in the inside, or it will error, this write notes is just convenient to know how to set
# automatically created By:scrapy startproject## for more information on the [Deploy] section see:# https://scrapyd.re Adthedocs.org/en/latest/deploy.html[settings]default = Adc.settings[deploy:bobby] #设置部署名称bobbyurl =/http Localhost:6800/ #开启urlproject = ADC #项目名称
Command Window Input: Scrapyd-deploy-l start service, you can see the deployment name we set
Before you start packing, execute a command: Scrapy list, this command executes the success instructions can be packaged, if not executed the success of the description of the work is not completed
Note that there is a good chance that an error will occur when executing the scrapy List command, and if python cannot find the scrapy project, you need to A Python-aware path is set in the settings.py configuration file in the Scrapy project
# Add the current project's first-level directory ADC directory to Python to identify the directory Base_dir = Os.path.dirname (Os.path.abspath (Os.path.dirname (__file__))) Sys.path.insert (0, Os.path.join (base_dir, ' ADC '))
If the error prompts, what the remote computer rejects, stating that your scrapy Project has a link to the remote computer, such as linked databases or elasticsearch (search engines), you need to start the linked server
The execute scrapy List command returns the crawler name stating everything OK, such as
So we can start packing the scrapy project to Scrapyd, and use the command to package it with the Scrapy.cfg file settings in the Scrapy project .
Scrapy.cfg file
# automatically created By:scrapy startproject## for more information on the [Deploy] section see:# https://scrapyd.re Adthedocs.org/en/latest/deploy.html[settings]default = Adc.settings[deploy:bobby] #设置部署名称bobbyurl =/http Localhost:6800/ #开启urlproject = ADC #项目名称
Execute Package Command: Scrapyd-deploy deployment name-P project name
Example: Scrapyd-deploy bobby-p ADC
The following display indicates that the Scrapy project was successfully packaged
Description of Scrapy Project after package success
The Scrapy project will generate the appropriate files in the directory where the SCRAPYD starts the service after it is packaged successfully , as follows:
1 . The Scrapy project name will be generated in the DBS folder under the Scrapyd Startup service directory. DB
2. The eggs folder in the Scrapyd Startup service directory generates scrapy Project name folder, which is a Scrapyd-deploy package generated name. Egg
3, the scrapy Crawler project will be packaged, in the Scrapy project will generate two folders, build folder and Project.egg-info folder
The build folder is a packaged reptile project, and the scrapyd after that is thepackaged item .
Project.egg-info folder is a package of some configuration
Description:Scrapyd-deploy is only responsible for the Scrapy Crawler Project package to SCRAPYD deployment, only need to package once, packaging, after the start crawler, stop crawler and other Scrapy project management by Scrapyd to finish.
strong> scrapyd admin scrapy Items
strong> Note: scrapyd Manage the curl command, The Curl command does not support Windows systems and supports only Linux systems, so we use Cmder to execute commands under the Windows system
1, travel reptiles, travel designated Scrapy under the specified crawler
Curl http://localhost:6800/schedule.json-d project=scrapy Project name-D spider= crawler name such as: Curl Http://localhost:6800/ schedule.json-d project=adc-d Spider=lagou
2. Stop the crawler
Curl http://localhost:6800/cancel.json-d project=scrapy Project name-D job= excursion ID such as: Curl http://localhost:6800/cancel.json-d Project=adc-d job=5454948c93bf11e7af0040167eb10a7b
3. Delete Scrapy Project
Note: The general deletion of scrapy items requires the execution of a command to stop the crawler under the project
Deleting an item deletes the eggs folder in the directory under the Scrapyd startup service to generate the egg file and needs to be re- packaged with Scrapyd-deploy to run again
Curl http://localhost:6800/delproject.json-d project=scrapy Project name if: Curl http://localhost:6800/delproject.json-d Project=adc
4. See how many scrapy items are in the API
Curl Http://localhost:6800/listprojects.json
5. See how many crawlers are in a specified Scrapy project
Curl Http://localhost:6800/listspiders.json?project=scrapy project name such as: Curl HTTP://LOCALHOST:6800/LISTSPIDERS.JSON?PROJECT=ADC
Introduction to SCRAPYD supported APIs
Scrapyd supports a series of APIs, which are described below with a py file
#-*-Coding:utf-8-*-import requestsimport json baseUrl = ' http://127.0.0.1:6800/' Daemurl = ' http://127.0.0.1:6800/ Daemonstatus.json ' Listprourl = ' http://127.0.0.1:6800/listprojects.json ' Listspdurl = ' http://127.0.0.1:6800/ listspiders.json?project=%s ' listspdvurl= ' http://127.0.0.1:6800/listversions.json?project=%s ' Listjoburl = ' http:/ /127.0.0.1:6800/listjobs.json?project=%s ' delspdvurl= ' Http://127.0.0.1:6800/delversion.json ' #http:// 127.0.0.1:6800/daemonstatus.json# View Scrapyd server running status r= requests.get (daemurl) print ' 1.stats: \ n%s \ n '%r.text #http:// 127.0.0.1:6800/listprojects.json# get the list of projects that have been published on the SCRAPYD server r= requests.get (listprourl) print ' 1.1.listprojects: [%s]\n \ n '%r.textif len (json.loads (R.text) ["Projects"]) >0:project = json.loads (R.text) ["Projects"][0] #http:// 127.0.0.1:6800/listspiders.json?project=myproject# Get the crawler list listspd=listspd under the project named MyProject on the SCRAPYD server projectr= Requests.get (listspdurl) print ' 2.listspiders: [%s]\n\n '%r.text if Json.loads (R.text). Has_key ("Spiders") >0 : Spider =json.loads (r.text) ["Spiders"][0] #http://127.0.0.1:6800/listversions.json?project=myproject## Gets the version listspdvurl=listspdvurl% projectr = Requests.get (listspdvurl) print ' 3 for each crawler under the project named MyProject on the Scrapyd server. Listversions: [%s]\n\n '%rtext If Len (json.loads (R.text) ["Versions"]) >0:version = json.loads (r.text) ["Versions"] [0] #http://127.0.0.1:6800/listjobs.json?project=myproject# gets a list of all tasks on the SCRAPYD server, including the end, running, and ready to start. Listjoburl=listjoburl% Pronamer=requests.get (listjoburl) print ' 4.listjobs: [%s]\n\n '%r.text #schedule. json#http:// 127.0.0.1:6800/schedule.json-d project=myproject-d spider=myspider# Start the MyProject crawler under Myspider Engineering on the SCRAPYD server, Make Myspider start running immediately, note that you must post Schurl = BaseURL + ' Schedule.json ' dictdata ={"project":p roject, "spider": spider}r= Reqeusts.post (Schurl, json= dictdata) print ' 5.1.delversion: [%s]\n\n '%r.text #http://127.0.0.1:6800/delversion.json -D project=myproject-d Version=r99 ' #删除scrapyd服务器上myproject的工程下的版本名为version的爬虫, note must be post Delverurl = BaseURL + ' DelVersion.json ' dictdata={"project":p roject, "version": Version}r= Reqeusts.post (Delverurl, json= dictdata) print ' 6.1. Delversion: [%s]\n\n '%r.text #http://127.0.0.1:6800/delproject.json-d project=myproject# Remove scrapyd on the server project, Note that the command automatically removes all spiders under the project, note that the Delprourl = BaseURL + ' Delproject.json ' dictdata={"project" must be in post mode:p roject}r= Reqeusts.post (Delverurl, json= dictdata) print ' 6.2.delproject: [%s]\n\n '%r.text
Summarize: 1, get status
Http://127.0.0.1:6800/daemonstatus.json2, getting a list of items
Http://127.0.0.1:6800/listprojects.json3, get the list of published crawlers under the project
HTTP://127.0.0.1:6800/LISTSPIDERS.JSON?PROJECT=MYPROJECT4, get list of released crawler versions under Project
HTTP://127.0.0.1:6800/LISTVERSIONS.JSON?PROJECT=MYPROJECT5, get crawler running state
Http://127.0.0.1:6800/listjobs.json?project=myproject
6. Start a crawler on the server (must be a crawler that has been published to the server)
Http://localhost:6800/schedule.json (post mode, data={"project": MyProject, "Spider": Myspider}) 7, delete a version of the crawler
Http://127.0.0.1:6800/delversion.json (post mode, data={"project": MyProject, "version": Myversion}) 8, delete a project, Includes various versions of Crawler Http://127.0.0.1:6800/delproject.json (post mode, data={"project": MyProject}) under this project
Here, Scrapyd-based crawler release tutorial is finished.
Some people may say, I directly with scrapy Cwal command also can execute crawler, personal understanding with Scrapyd Server Management crawler, at least have the following several advantages:
1, can avoid the crawler source is seen.
2, there is version control.
3, can be remote start, stop, delete, it is because of this, so Scrapyd is also a distributed crawler solution.
51 Python distributed crawler build search engine scrapy explaining-scrapyd deploy Scrapy project