Now we're introducing a scrapy crawler project on an extension that requires data to be stored in MongoDB
Now we need to set up our crawler files in setting.py.
Add Pipeline again
The reason for this comment is that after the crawler executes, and the local storage is completed, the host is also required to be stored, causing stress to the host.
After setting up these, open the Redis service on the master host, place the code copy on the other host, note the operating system type and the configuration
Then crawl on each host, crawling faster and with different results
Adding this to the setting will ensure that the crawler is not emptied.
Sets whether the queue is emptied when this decision is re-crawled, usually with false
Are we now going to perform a crawl on the host, and now I want to control all the bots directly on a single host, and now introduce Scrapyd, which will start the Web service to manage all the projects
Take a look at the steps
1 Start Scrapyd
2 can be accessed remotely
3 using Scprapyd-client to package a project
4 Modify the Scrapy of the crawler. cfg file
Change the address to a remote Scrapyd service address
Execute this command to complete the deployment
Open a remote process
Open a few instructions, execute a few processes, each job has an ID if it's a task of multiple machines, the ID is different.
Python3 scrapy Crawler (volume 14th: scrapy+scrapy_redis+scrapyd Build distributed crawler execution)