Scrapyd is a tool for deploying and running the Scrapy project, and with it, you can upload the written scrapy project to the cloud host and control its operation through the API.
Since it is a scrapy project deployment and basically uses a Linux host, the installation of this section is for Linux hosts.
1. RELATED LINKS
- Github:https://github.com/scrapy/scrapyd
- Pypi:https://pypi.python.org/pypi/scrapyd
- Official Document: Https://scrapyd.readthedocs.io
2. PIP installation
The PIP installation is recommended here, with the following command:
PIP3 Install Scrapyd
3. Configuration
After installation, you need to create a new profile/etc/scrapyd/scrapyd.conf,scrapyd will read this profile when it is run.
After the Scrapyd 1.2 version, the file is not automatically created and needs to be added by ourselves.
First, create a new file by executing the following command:
sudo mkdir/etc//etc/scrapyd/scrapyd.conf
Then write the following content:
[Scrapyd]eggs_dir=Eggslogs_dir=Logsitems_dir=Jobs_to_keep=5Dbs_dir=Dbsmax_proc=0max_proc_per_cpu=TenFinished_to_keep= -Poll_interval=5.0bind_address=0.0.0.0Http_port=6800Debug=Offrunner=scrapyd.runnerapplication=Scrapyd.app.applicationlauncher=Scrapyd.launcher.Launcherwebroot=Scrapyd.website.root[services]schedule.json=Scrapyd.webservice.Schedulecancel.json=Scrapyd.webservice.Canceladdversion.json=Scrapyd.webservice.AddVersionlistprojects.json=Scrapyd.webservice.ListProjectslistversions.json=Scrapyd.webservice.ListVersionslistspiders.json=Scrapyd.webservice.ListSpidersdelproject.json=Scrapyd.webservice.DeleteProjectdelversion.json=Scrapyd.webservice.DeleteVersionlistjobs.json=Scrapyd.webservice.ListJobsdaemonstatus.json= Scrapyd.webservice.DaemonStatus
The contents of the configuration file can be found in the official document Https://scrapyd.readthedocs.io/en/stable/config.html#example-configuration-file. The configuration file here has been modified, one of which is the max_proc_per_cpu
official default of 4, that is, one host each CPU runs up to 4 scrapy tasks, which increases to 10. The other is bind_address
that the default is local 127.0.0.1, which is modified to 0.0.0.0, so that the extranet can be accessed.
4. Running in the background
Scrapyd is a pure Python project, where it can be called directly to run. To keep the program running in the background, Linux and Mac can use the following commands:
(Scrapyd >/dev/null &)
This scrapyd will continue to run in the background, the console output is ignored directly. Of course, if you want to log output logs, you can modify the output target, such as:
(Scrapyd > ~/scrapyd.log &)
Of course, you can also use screen, Tmux, supervisor and other tools to implement process daemon.
After running, you can access the Web UI on port 6800 of the browser, where you can see the current scrapyd running tasks, logs, and so on.
Of course, the better way to run Scrapyd is to use the Supervisor daemon, if interested, you can refer to: http://supervisord.org/.
In addition, SCRAPYD supports Docker, and we'll show you how to make and run the Scrapyd Docker image later.
5. Access authentication
After the configuration is complete, both the SCRAPYD and its interfaces are publicly accessible. If you want to configure access authentication, you can use Nginx to do the reverse proxy, the first installation of the Nginx server.
Here is an example of Ubuntu, with the following installation commands:
sudo apt-get install Nginx
Then modify the Nginx configuration file nginx.conf, add the following configuration:
http { server { 6801; / { proxy_pass http://127.0.0.1:6800/; Auth_basic "Restricted"; Auth_basic_user_file /etc/nginx/conf.d/. htpasswd; }}}
The user name and password configuration used here is placed in the/ETC/NGINX/CONF.D directory and we need to create it using the htpasswd
command. For example, create a file with User name Admin, with the following command:
Htpasswd-c. htpasswd Admin
Then we will be prompted to enter the password, enter two times, will generate a password file. Now look at the contents of this file:
Cat. htpasswd ADMIN:5ZBXQR0RCQWBC
After the configuration is complete, restart the Nginx service and run the following command:
sudo nginx-s reload
This will successfully configure the Scrapyd access authentication.
Installation of Sesame Http:scrapyd