Chapter 2 scrapy-redis distributed crawler, Chapter 2 scrapy-redis
9-1 Key Points of distributed crawling
1. Advantages of distributed architecture
- Make full use of the bandwidth of multiple machines to accelerate crawling
- Make full use of the IP addresses of multiple hosts to accelerate crawling
Q: Why does scrapy not support distributed deployment?
A: In scrapy, scheduler runs in the queue, while the queue is in the single-host memory. crawlers on the server cannot use the memory queue for any processing. Therefore, scrapy does not support distributed processing.
2. Distributed Problems to be Solved
- Centralized requests Queue Management
- Decentralized and centralized management
So we need to solve it with redis.
9-2 ~ 3 basic knowledge of redis I. Installation of redis (windows 64-bit)
1. Baidu: redis for windows find the installation package on github
Click to download
Cmd switch to the downloaded directory
Run the following command:
This is already started. You can enter related commands for testing.
Ii. Redis Data Types
- String
- Hash/hash
- List
- Set
- Sortable set
1. string command
Set mykey ''cnblogs' to create a variable
Get mykey
Getrange mykey start end obtains the string, for example, get name 2 5 # obtain name2 ~ 5 string
Strlen mykey get Length
Incr/decr mykey plus one minus one, type is int
Append mykey ''com ''to add a string to the end
2. Hash command
Hset myhash name "cnblogs" creates a variable. myhash is similar to the variable name, and name is similar to the key. "cnblogs" is similar to values.
Hgetall myhash get key and values
Hget myhash name to get values
Hexists myhash name check whether this key exists
Hdel myhash name Delete this key
Hkeys myhash view key
Hvals muhash view values
3. LIST commands
Lpush/rpush mylist "cnblogs" add value to left/add value to right
Lrange mylist 0 10 View list 0 ~ Value of 10
Blpop/brpop key1 [key2] timeout: delete one from left/right. If there is no key in timeout, it will end after the set time.
Lpop/rpop key is deleted left/right, with no waiting time.
Obtain the length of llen key
The lindex key index takes the index element, and the index starts from 0.
4. Set commands (not repeated)
Sadd myset "cnblogs" add content. If the returned value is 1, it indicates that the content does not exist. If the returned value is 0, it indicates that the content exists.
Scard key to view values in the set
Sdiff key1 [key2] Two sets are used for subtraction, which is actually the part of the communication.
Sinter key1 [key2] the addition of two sets leaves the intersection of the two.
Spop key random deletion Value
Srandmember key member random get member values
Smember key to get all elements
5. sortable set commands
Zadd myset 0 'project1 '[1 'project2'] Add a set element. brackets are not available, which is easy to understand here.
Zrangebyscore myset 0 100 select the score from 0 ~ Elements of 100
Zcount key min max selects scores in min ~ Max number of elements
Iii. Redis Documentation 9-4 ~ 9. All sections mainly explain scrapy-redis
You can see how to use scrapy-redis on github.
The bloomfilter bloom filter is integrated into scrapy-redis.
The source code is not fully understood. I will not explain it first.
For more information about the code, see my github: scrapy-redis application project.
Author: Jin Xiao
Source: http://www.cnblogs.com/jinxiao-pu/p/6838011.html
The copyright of this article is shared by the author and the blog. You are welcome to repost this article, but you must keep this statement without the author's consent and provide a connection to the original article on the article page.
If you think it is good, click a recommendation!