Before, our crawler was a stand-alone crawl, and it was a single-machine maintenance request queue,
Take a look at the flowchart of a single machine:
A host control a queue, now I want to put it in a multi-machine execution, will produce a thing is to do the repeated crawl, meaningless, so the first difficulty of the distributed crawler out, share the request queue, look at the architecture:
Three hosts are controlled by a queue, which means that a host is also required to control the queue, and we typically use Redis to control the queue to form the following distributed architecture
Slave fetching, the storage host is responsible for controlling the queue
Scrapy_redis This plugin solves the problem that scrapy cannot do a distributed crawl
Its internal connection.py as a redis that connects master
Dupefilter. PY is used as a de-weight, add fingerprint, and judging function, now the entire framework understands, now it's time to do
Python3 scrapy Crawler (Volume 13th: Scrapy+scrapy_redis+scrapyd Build a distributed crawler configuration)