Chapter 2 scrapy-redis distributed crawler, Chapter 2 scrapy-redis

Source: Internet
Author: User

Chapter 2 scrapy-redis distributed crawler, Chapter 2 scrapy-redis
9-1 Key Points of distributed crawling

1. Advantages of distributed architecture

  • Make full use of the bandwidth of multiple machines to accelerate crawling
  • Make full use of the IP addresses of multiple hosts to accelerate crawling

Q: Why does scrapy not support distributed deployment?

A: In scrapy, scheduler runs in the queue, while the queue is in the single-host memory. crawlers on the server cannot use the memory queue for any processing. Therefore, scrapy does not support distributed processing.

2. Distributed Problems to be Solved

  • Centralized requests Queue Management
  • Decentralized and centralized management

So we need to solve it with redis.

9-2 ~ 3 basic knowledge of redis I. Installation of redis (windows 64-bit)

1. Baidu: redis for windows find the installation package on github

Click to download

Cmd switch to the downloaded directory

Run the following command:

This is already started. You can enter related commands for testing.

Ii. Redis Data Types
  • String
  • Hash/hash
  • List
  • Set
  • Sortable set

1. string command

Set mykey ''cnblogs' to create a variable

Get mykey

Getrange mykey start end obtains the string, for example, get name 2 5 # obtain name2 ~ 5 string

Strlen mykey get Length

Incr/decr mykey plus one minus one, type is int

Append mykey ''com ''to add a string to the end

2. Hash command

Hset myhash name "cnblogs" creates a variable. myhash is similar to the variable name, and name is similar to the key. "cnblogs" is similar to values.

Hgetall myhash get key and values

Hget myhash name to get values

Hexists myhash name check whether this key exists

Hdel myhash name Delete this key

Hkeys myhash view key

Hvals muhash view values

3. LIST commands

Lpush/rpush mylist "cnblogs" add value to left/add value to right

Lrange mylist 0 10 View list 0 ~ Value of 10

Blpop/brpop key1 [key2] timeout: delete one from left/right. If there is no key in timeout, it will end after the set time.

Lpop/rpop key is deleted left/right, with no waiting time.

Obtain the length of llen key

The lindex key index takes the index element, and the index starts from 0.

4. Set commands (not repeated)

Sadd myset "cnblogs" add content. If the returned value is 1, it indicates that the content does not exist. If the returned value is 0, it indicates that the content exists.

Scard key to view values in the set

Sdiff key1 [key2] Two sets are used for subtraction, which is actually the part of the communication.

Sinter key1 [key2] the addition of two sets leaves the intersection of the two.

Spop key random deletion Value

Srandmember key member random get member values

Smember key to get all elements

5. sortable set commands

Zadd myset 0 'project1 '[1 'project2'] Add a set element. brackets are not available, which is easy to understand here.

Zrangebyscore myset 0 100 select the score from 0 ~ Elements of 100

Zcount key min max selects scores in min ~ Max number of elements

Iii. Redis Documentation 9-4 ~ 9. All sections mainly explain scrapy-redis

You can see how to use scrapy-redis on github.

The bloomfilter bloom filter is integrated into scrapy-redis.

The source code is not fully understood. I will not explain it first.

For more information about the code, see my github: scrapy-redis application project.

Author: Jin Xiao
Source: http://www.cnblogs.com/jinxiao-pu/p/6838011.html
The copyright of this article is shared by the author and the blog. You are welcome to repost this article, but you must keep this statement without the author's consent and provide a connection to the original article on the article page.

If you think it is good, click a recommendation!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.