Summary of scrapyd Crawlers
I. Version Information
Python has been recognized by many programmers by a wide variety of third-party class libraries, but it also brings about many class library versions. This article summarizes the content based on the latest class library version.
1. scrapy version: 1.1.0D: \ python \ Spider-master \ ccpmess> scrapy version-vScrapy: 1.1.0lxml: Protocol: 2.9.0Twisted: 16.1.1Python: 2.7.11rc1 (Protocol: Protocol, Nov 21 2015, 23:25:27) [MSC v.1500 64-bit (AMD64)] pyOpenSSL: 16.0.0 (OpenSSL 1.0.2G 1 Mar 2016) Platform: Windows-10-10.0.10586
4. scrapyd version: 1.1.0D: \> scrapyd2016-06-30 15:21:14 + 0800 [-] Log opened.2016-06-30 15:21:14 + 0800 [-] twistd 16.1.1 (C: \ Python27 \ python.exe 2.7.11) starting up.2016-06-30 15:21:14 + 0800 [-] reactor class: twisted. internet. selectreactor. selectReactor.2016-06-30 15:21:14 + 0800 [-] Site starting on 68002016-06-30 15:21:14 + 0800 [-] Starting factory <twisted. web. server. site instance at 0x00000000046D6808> 15:21:14 + 0800 [Launcher] Scrapyd 1.1.0 started: max_proc = 16, runner = 'scrapyd. runner'
Ii. Official website documents 1. scrapy
Http://scrapy.readthedocs.io/en/latest/
Http://scrapy-chs.readthedocs.io/zh_CN/latest/
2. scrapyd
Http://scrapyd.readthedocs.io/en/latest
In particular, many scrapy and scrapyd documents searched by Baidu are outdated because these three-party libraries have been updated, resulting in many previous summaries, which are not applicable now.
Although both documents are in E-text (scrapy has a Chinese version, but not all), it is a challenge for everyone, but it is still possible to simply check the information.
Iii. Purpose 1. scrapy
The well-known crawler framework is relatively modular and structured. The efficiency and stability of developing crawlers Based on scrapy are much better than those from scratch.
2. scrapyd
Scrapyd is a service for running Scrapy spiders.
It allows you to deploy your Scrapy projects and control their spiders using a http json api.
Scrapyd can manage multiple projects and each project can have multiple versions uploaded, but only the latest one will be used for launching new spiders.
Scrapyd is a service program that runs scrapy crawlers. It supports Publishing, deleting, starting, and stopping crawlers using http commands. In addition, scrapyd can manage multiple crawlers at the same time, and each crawler can have multiple versions.
Iv. Instructions for use 1. install scrapyd D: \> pip install scrapyd
Using the pip command, the system automatically installs the secondary scrapyd and dependent components.
Note that currently, scrapyd on the pip source is not the latest release version by default, and many commands may not be supported.
Therefore, you are advised to go to the installation setup. py install command for manual installation.
C: \ Python27 \ Lib \ site-packages \ scrapyd-master> dir setup. py
The volume in drive C is Windows
The serial number of the volume is 9C3D-C0EC.
C: \ Python27 \ Lib \ site-packages \ scrapyd-master Directory
1,538 setup. py
1 file, 1,538 bytes
0 directories, 26,489,679,872 available bytes
C: \ Python27 \ Lib \ site-packages \ scrapyd-master> python setup. py install
......
2. Run scrapyd D: \> scrapyd
23:12:16 + 0800 [-] Log opened.
23:12:16 + 0800 [-] twistd 16.1.1 (C: \ Python27 \ python.exe 2.7.11) starting up.
23:12:16 + 0800 [-] reactor class: twisted. internet. selectreactor. SelectReactor.
23:12:16 + 0800 [-] Site starting on 6800
23:12:16 + 0800 [-] Starting factory <twisted. web. server. Site instance at 0x0000000004688808>
23:12:16 + 0800 [Launcher] Scrapyd 1.1.0 started: max_proc = 16, runner = 'scrapyd. runner'
By default, scrapyd listens to port 6800 after running. Note that the latest version only includes three menu items: jobs, logs, and documentation. The version installed in pip has four menus after running.
#-*-Coding: UTF-8 -*-
Import requests
Import json
BaseUrl = 'HTTP: // 127.0.0.1: 6800 /'
DaemUrl = 'HTTP: // 127.0.0.1: 6800/daemonstatus. json'
ListproUrl = 'HTTP: // 127.0.0.1: 6800/listprojects. json'
ListspdUrl = 'HTTP: // 127.0.0.1: 6800/listspiders. json? Project = % s'
ListspdvUrl = 'HTTP: // 127.0.0.1: 6800/listversions. json? Project = % s'
ListjobUrl = 'HTTP: // 127.0.0.1: 6800/listjobs. json? Project = % s'
DelspdvUrl = 'HTTP: // 127.0.0.1: 6800/delversion. json'
# Http: // 127.0.0.1: 6800/daemonstatus. json
# Check the running status of the scrapyd Server
R = requests. get (daemUrl)
Print '1. stats: \ n % s \ n \ n' % r. text
# Http: // 127.0.0.1: 6800/listprojects. json
# Obtain the list of projects released on the scrapyd Server
R = requests. get (listproUrl)
Print '1. 1. listprojects: [% s] \ n \ n' % r. text
If len (json. loads (r. text) ["projects"])> 0:
Project = json. loads (r. text) ["projects"] [0]
# Http: // 127.0.0.1: 6800/listspiders. json? Project = myproject
# Retrieve the crawler list for a project named myproject on the scrapyd Server
Listspd = listspd % project
R = requests. get (listspdUrl)
Print '2. listspiders: [% s] \ n \ n' % r. text
If json. loads (r. text). has_key ("spiders")> 0:
Spider = json. loads (r. text) ["spiders"] [0]
# Http: // 127.0.0.1: 6800/listversions. json? Project = myproject
# Obtain the version of each crawler under the project named myproject on the scrapyd Server
ListspdvUrl = listspdvUrl % project
R = requests. get (listspdvUrl)
Print '3. listversions: [% s] \ n \ n' % rtext
If len (json. loads (r. text) ["versions"])> 0:
Version = json. loads (r. text) ["versions"] [0]
# Http: // 127.0.0.1: 6800/listjobs. json? Project = myproject
# Obtain the list of all tasks on the scrapyd server, including those that are completed, running, and ready to be started.
ListjobUrl = listjobUrl % proName
R = requests. get (listjobUrl)
Print '4. listjobs: [% s] \ n \ n' % r. text
# Schedule. json
# Http: // 127.0.0.1: 6800/schedule. json-d project = myproject-d spider = myspider
# Start the myspider crawler under the myproject project on the scrapyd server so that myspider can start running immediately.
SchUrl = baseurl + 'schedule. json'
Dictdata = {"project": project, "spider": spider}
R = reqeusts. post (schUrl, json = dictdata)
Print '5. 1. delversion: [% s] \ n \ n' % r. text
# Http: // 127.0.0.1: 6800/delversion. json-d project = myproject-d version = r99'
# Delete the version crawler under the project of myproject on the scrapyd server. Note that it must be post
DelverUrl = baseurl + 'delversion. json'
Dictdata = {"project": project, "version": version}
R = reqeusts. post (delverUrl, json = dictdata)
Print '6. 1. delversion: [% s] \ n \ n' % r. text
# Http: // 127.0.0.1: 6800/delproject. json-d project = myproject
# Delete the myproject project on the scrapyd server. Note that this command will automatically delete all spider under the project.
DelProUrl = baseurl + 'delproject. json'
Dictdata = {"project": project}
R = reqeusts. post (delverUrl, json = dictdata)
Print '6. 2. delproject: [% s] \ n \ n' % r. text
Summary:1. Get status
Http: // 127.0.0.1: 6800/daemonstatus. json2. Obtain the project list
Http: // 127.0.0.1: 6800/listprojects. json3. Obtain the list of published crawlers under the project.
Http: // 127.0.0.1: 6800/listspiders. json? Project = myproject4: Obtain the list of published crawler versions under the project
Http: // MAID: 6800/listversions. json? Project = myproject5, get crawler running status
Http: // MAID: 6800/listjobs. json? Project = myproject
6. Start a crawler on the server (it must be a crawler that has been published to the server)
Http: // localhost: 6800/schedule. json (post mode, data = {"project": myproject, "spider": myspider}) 7. delete a specific version of Crawler
Http: // 127.0.0.1: 6800/delversion. json (post method, data = {"project": myproject, "version": myversion}) 8. delete a project, including the versions of the project crawler http: // 127.0.0.1: 6800/delproject. json (post method, data = {"project": myproject })
Here we can see that there are APi for deleting crawlers, APIs for starting crawlers, and crawlers are not released independently. Why?
Because crawlers need to use another dedicated tool, Scrapyd-client.
5. Release the crawler tool Scrapyd-client
Scrapyd-client is a tool specifically used to publish scrapy crawlers. After the program is installed, a tool named scrapyd-deploy is automatically installed in c: \ python \ scripts, it is similar to setup. python script of py, so you can run it in python scrapyd-deploy mode ). 1. Installation Method C: \> pip install Scrapyd-client
......
C: \ Python27 \ Scripts> dir SC *
The volume in drive C is Windows
The serial number of the volume is 9C3D-C0EC.
C: \ Python27 \ Scripts directory
313 scrapy-script.py
74,752 scrapy.exe
9,282 scrapyd-deploy
318 scrapyd-script.py
74,752 scrapyd.exe
159,417 bytes for five files
2. Run method 1). Copy scrapyd-deploy to D: \ python \ Spider-master \ ccpmess> dir in the crawler directory.
The volume in drive D has no labels.
The serial number of the volume is 36D9-CDC7
D: \ python \ Spider-master \ ccpmess directory
<DIR>.
<DIR> ..
<DIR> ccpmess
662 ccpmess-main.py
1,022 ccpmess. wpr
78,258 ccpmess. wpu
324 scrapy. cfg
9,282 scrapyd-deploy
2) modify the crawler's scapy. cfg file
First, remove the annotator before the url. Here the url is the url of your scrapyd server.
Second, deploy: 127 indicates that the crawler is published to the crawler server named 127.
The target name is generally used to publish crawlers to multiple target servers at the same time by specifying the name.
D: \ python \ Spider-master \ ccpmess> type scrapy. cfg
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# Https://scrapyd.readthedocs.org/en/latest/deploy.html
[Settings]
Default = ccpmess. settings
[Deploy: 127]
Url = http: // FIG: 6800/
Project = ccpmess
[Deploy: 141]
Url = http: // 138.0.0.141: 6800/
Project = ccpmess
3) view configurations
Check whether the scrapy configuration is correct.
D: \ python \ Spider-master \ ccpmess> python scrapyd-deploy-l
141 http: // 138.20.1.141: 6800/
127 http: // 127.0.0.1: 6800/
4) Publish Crawlers
scrapyd-deploy <target> -p <project> --version <version>
Target is the target name after deploy in the preceding configuration file.
The project can be defined at will, regardless of the crawler project name.
Custom version number. If this parameter is not specified, the current timestamp is used by default.
Note: Do not put irrelevant py files in the crawler directory. Releasing irrelevant py files will cause publishing failure.
D: \ python \ Spider-master \ ccpmess> python scrapyd-deploy 127-p projectccp -- version ver20160702
Packing version ver20160702
Deploying to project "projectccp" in http: // 127.0.0.1: 6800/addversion. json
Server response (200 ):
{"Status": "OK", "project": "projectccp", "version": "ver20160702", "spiders": 1, "node_name": "compter ...... "}
At this point, the scrapyd-based crawler publishing tutorial is complete.
Someone may say that I can directly use the scrapy cwal command to execute crawlers. My personal understanding is that using the scrapyd server to manage crawlers has at least the following advantages:
1. You can avoid seeing crawler source code.
2. Version Control is available.
3. remote start, stop, and delete. Because of this, scrapyd is also one of the distributed crawler solutions.
<-- END -->