Distributed crawler drills.
In fact, the problem of distributed crawler is that multiple spiders are dealing with multiple URLs at the same time, how to schedule these URLs and how to summarize the data crawled by spiders. The simplest way to do this is to fragment the URL, give it to a different machine, and finally summarize the data that is being crawled by different machines. However, each spider can only deal with their own URL to the weight, no way to the overall weight, in addition to the performance is very difficult to control, there may be a machine very early run out, and other machines have to run for a long time. Another idea is to put the URL somewhere, shared to all the machines, the total scheduler to allocate the request, to determine whether the spider is idle, idle to continue to give it the task, until all the URLs are crawled, this method solves the problem of de-weight (described below), can also improve performance, Scrapy-redis has implemented such a complete framework, which, in general, is more suitable for breadth-first crawls. Scrapyd
Scrapy does not provide built-in distributed crawling, but there are many ways to do it.
If you have a lot of spiders, the simplest way is to start multiple Scrapyd instances and distribute the spiders across machines.
If you want multiple machines to run the same spider, you can shard the URL and hand it over to the spider on each machine. For example, you divide the URL into 3 copies.
Http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/ Part2.list
Http://somedomain.com/urls-to-crawl/spider1/part3.list
Then run 3 Scrapyd instances, start them separately, and pass the part parameter
Curl http://scrapy1.mycompany.com:6800/schedule.json-d project=myproject-d spider=spider1-d part=1
Curl/HTTP/ scrapy2.mycompany.com:6800/schedule.json-d project=myproject-d spider=spider1-d part=2
Crawlera
This, money can be easily solved ~ Direct Scrapy-redis
Redis is a high-performance Key-value database. We know that MongoDB keeps the data on the hard drive, and the magic of Redis is that it keeps the data in memory, so it brings higher performance. Distributed Principle
Scrapy-redis implementation of the distributed, in fact, from the principle is very simple, here for the convenience of description, we call our core server master, and the machine used to run the crawler called Slave.
Recalling the Scrapy framework, we first give some start_urls,spider first access to start_urls inside the URL, and then according to our parse function, the inside of the elements, or other two-level, three-level page crawl. and to realize the distribution, just need to make a fuss in this starts_urls. Further descriptions are as follows:
The master generation Starts_urls,url will be encapsulated as a request put into Redis spider:requests, the total scheduler will assign the request from here, when the request is allocated, will continue to assign St The URL in the Art_urls.
Slave from the master Redis to fetch the request, download the Web page after the content of the page sent back to master Redis,key is spider:items. Scrapy can be settings to let spider crawl end not automatically shut down, but constantly to ask the queue there is no new URL, if there is a new URL, then continue to get the URL and crawl, so this process will continue to loop.
The Reids in master also has a key that "Spider:dupefilter" is used to store the crawled URL fingerprint (the result of the URL operation using a hash function), preventing a repeat crawl, as long as Redis does not empty, you can continue to crawl the breakpoint.
For existing scrapy programs, it is relatively easy to extend them to distributed programs. In general, here are the following steps: Find a high-performance server for Redis queue maintenance and data storage. Extend the Scrapy program to get start_urls through the server's Redis and rewrite the data store portion of the pipeline to change the storage address to the server address. Write some scripts that generate URLs on the server and execute them on a regular basis.
About scheduler in the end is how to dispatch, need to look at the source for analysis. Source Analysis
Perhaps the above description is not clear enough, simply look at the source bar, Scrapy-redis mainly to a few files. Part Analysis
connection.py
Redis connections are instantiated according to the configuration in settings, and are called by Dupefilter and scheduler.
dupefilter.py
The request is de-weighed and the redis set is used.
queue.py
Three types of queue, Spiderqueue (FIFO), Spiderpriorityqueue, and Spiderstack (LiFi). The second type is used by default.
pipelines.py
Distributed processing, the item is stored in Redis.
scheduler.py
Replace the scrapy with the scheduler, realize distributed scheduling, data structure from the queue.
spider.py
Defines the redisspider.py, inheriting the Redismixin and Crawlspider.
From the above, Scrapy-redis implementation of the crawler distributed and item processing distributed by the module scheduler and module pipelines implementation. The other modules are used as auxiliary function modules. Scheduling Process Initialize
When the spider is initialized, a corresponding scheduler object is initialized at the same time, and the scheduler object configures its own dispatch container queue and the Counterweight tool Dupefilter by reading the settings. weighing & entering the dispatch pool
Whenever a spider produces a request, the Scrapy kernel submits the request to the spider's Scheduler object for dispatch, and the Scheduler object weighs the request by accessing Redis. , add him to the schedule pool in Redis if you don't repeat it. Scheduling
When the scheduling condition is met, the scheduler object pulls a request from the Redis scheduling pool to the spider, crawls it, and, if more URLs are returned during the crawl, continues until all request finishes. In this process through the Connect Signals.spider_idle signal to the crawler state monitoring, scheduler object found that the spider crawled all the temporary available URLs, the corresponding Redis schedule pool is empty, so trigger the signal SPID Er_idle,spider received this signal, directly connected to Redis read the Strart_url pool, take the new batch of URLs, return the new Make_requests_from_url (URL) to the engine, and then to the Scheduler scheduler.
Familiar with the principle can actually write scheduler, their own definition of scheduling priority and order,