Use Bloomfilter optimization Scrapy-redis to go heavy

Last Update:2018-07-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

use Bloomfilter optimization Scrapy-redis to go heavy 1. BackgroundAs reptiles know, scrapy is a very useful crawler framework, but scrapy eat memory is very severe. One of the key points is to go heavy. "Go heavy" needs to consider three questions: the speed of the heavy weight and the amount of data to be heavy, as well as persistent storage to ensure that the crawler can continue to crawl.
the speed of going heavy: In order to ensure a high speed to go, the general will be replayed into memory to do. For example, Python's built-in set (), Redis set data structure. But when the amount of data becomes very large, to reach the Chivan level, because the memory is limited, you need to use the "bit" to come and go, so bloomfilter came into being, will be to work from the string directly to the bit bit, greatly reducing the memory share. The amount of data to go heavy: When the amount of data is large, we can use different encryption algorithms, compression algorithms (such as Md5,hash), and so on, the long string into a 16/32-length short string. Then use the set and so on to go heavy. Persistent Storage: Scrapy By default is open to heavy, and provides a continuous crawl design, when the crawler terminates, will record a state file records crawled request and state. Scrapy-redis to the Redis, to the heavy queue into the Redis, and Redis can provide persistent storage. Bloomfilter is to map the object to a few memory "bits", using a few bits of 0/1 value to determine whether an object already exists. Bloomfilter running on the memory of a machine is not convenient to persist, once the crawler is terminated, the data is lost. As mentioned above, for Scrapy-redis distributed reptiles, using bloomfilter to optimize, there are inevitably two problems:
First, think of ways to keep bloomfilter stored. Second, for Scrapy-redis distributed reptiles, reptiles are distributed across several different machines. And Bloomfilter is based on memory, how to let each of the different crawler machines can be shared to the same bloomfilter, to achieve unity and weight. Sum up mount the Bloomfilter on the Redis, persisting storage and allowing each reptile to share a heavy queue, both problems are solved. 2. EnvironmentalSystem: Win7 scrapy-redis redis 3.0.5 python 3.6.1 3. Bloom Filter Basic concepts and principlesFor more information please refer to the article: http://blog.csdn.net/jiaomeng/article/details/1495500

In short, Bloom filter is: Bloom filter is a highly efficient random data structure, using a bit array to represent a set, and to determine whether an element belongs to this set. This high efficiency of Bloom filter has a price: when judging whether an element belongs to a set, it is possible to mistake elements that do not belong to this set as belonging to this set (false positive). Of course, if this element belongs to this set, it must not be misjudged as not being present in this set. Bloom filter is not suitable for those "0 error" applications.

To understand the principles of bloom filter, you must be familiar with the concepts of the following basic elements: 3.1. The Bloom filter bit array Bloom filter represents a collection with a bit array. In the initial state, Bloom filter is a bit array containing M bit {1, ..., m} , each digit is set to 0. The representation can be a blank memory, a long string, any kind of data structure that occupies memory space ...
3.2. To go to the heavy elements for reptiles, that is, the request queue, we remember as S = {R1, R2, ..., Rn} such a set of n elements. 3.3. K Independent hash Function Bloom filter uses K-independent hash functions, which we note as H = {H1 (), H2 (), ..., Hk ()} . Using these hash functions, each element in the set S = {R1, R2, ..., Rn} is processed, mapped to one of the Bloom filter Open memory {1, ..., m} . Thus, for R1 , the result of the mapping is {H1 (R1), H2 (R1), ..., Hk (R1)}

It should be noted that if a position is set to 1 multiple times, only the first time will work, and the following few times will have no effect.
From this we can see why there is a miscalculation, and it is possible to mistake elements that do not belong to this set as belonging to the set, because it is likely that the element has been set to 1 on those bits that are mapped. 3.4. Error rate Bloomfilter algorithm will have a missing probability, that is, the nonexistent string has a certain probability of being misjudged as already exists. The size of this probability is related to the number of seeds, the amount of memory requested, and the number of objects to be weighed. Here is a table where m indicates the size of memory (how many digits), n indicates the number of objects to be weighed, and K indicates the number of seed. For example, I applied 256M in my code, that is, 1<<31 (m=2^31, about 2.15 billion), and the seed was set to 7. Look at the k=7 column, when the leakage rate is 8.56e-05, the m/n value is 23. So n = 21.5/23 = 0.93 (billion), indicating a loss probability of 8.56e-05, 256M memory can meet the weight of 93 million strings. Similarly, when the leakage rate is 0.000112, 256M memory can meet the weight of 98 million strings.
4. setbit function of Redis 4.1. Official Note

# setbit Key Offset value: Sets or clears the bit (bit) at the specified offset by setting or clearing the string value stored in the key's string value offset

to key.

the setting or removal of a bit depends on the value parameter, which can be 0 or 1.

automatically generates a new string value when key does not exist. The

string is stretched (grown) to ensure that it can hold value at the specified offset. When the string value is stretched, the blank position is filled with 0.

The offset parameter must be greater than or equal to 0, less than 2^32 (bit mappings are limited to MB).

4.2. Use case

In Redis, strings are stored in the form of a two-level system.

First step : Set a Key-value, string teststr = ' ab '

We know that the ASCII code for ' A ' is 97, and the conversion to binary is: 01100001
The ASCII code for ' B ' is 98, converted to binary is: 01100010.
The conversion of ' AB ' into binary is: 0110000101100010

Step two : Set the offset
Offset represents an excursion, starting at 0 , counting from left to right , that is, from high to low.
For example, we want to put 011000010110001 0 (AB) into 011000010110001 1(AC), which is to place the 15th position from 0 to 1, at which point B becomes a C

After Setbit, there will be an (integer) 0 or (integer) 1 return value, which is the bit value of the offset bit before the setbit is done.
This is the basic use of "setbit" in Redis.

Redis also has a related function:bitcount, which is the number of ' 1 ' in the two-level coding of the statistical string. So here
The result of Bitcount Teststr is 7.

5. Detailed deployment combined with the setbit features of the above Bloom Filter and Redis, we know how to mount the Bloom filter on the Redis. Yes, it's a big string. The following is a detailed procedure for hanging bloom filter in a Scrapy-redis distributed crawler: 5.1. Write the Bloom filter algorithm.

# files: bloomfilter.py # encoding=utf-8 Import Redis from hashlib import MD5 # According to the size and seed of the memory, generate different hash functions # that is to say: Blo OM Filter uses K-independent hash functions, which we note as **h = {H1 (), H2 (), ..., Hk ()}** class Simplehash (object): Def __init__ (self, Bitsi  Ze, seed): Self.bitsize = bitsize self.seed = Seed def hash (self, value): ret = 0 for I in range (len value): # print (f "value[i] = {Value[i]}, Ord (value[i]) = {ord (value[i)}") ret +
        = self.seed * ret + ord (Value[i]) # control HashValue values in this memory space HashValue = (self.bitsize-1) & RET # print (f "value = {value}, HashValue = {HashValue}") return HashValue # Initializes a large string in Redis, or it can be thought to have opened up a chunk of memory in the Redis
Space # You need to specify the database name, for example, here is DB2 # Specifies the number of blocks to use, that is, to open up several such large strings. # When the data reaches very large, 512M is definitely not enough, maybe each bit is set to 1, so need to open up a number of large string # large string name = (key + int) class Bloomfilter (object): Def __init__ (sel
        F, host= ' localhost ', port=6379, db=2, Blocknum=1, key= ' Bloomfilter '): "":p Aram Host:the host of Redis:p Aram Port:the Port of Redis:p Aram Db:witch DB in Redis:p Aram Blocknum:one blocknum for about 90,000,000;
        If you are have more strings for filtering, increase it. :p Aram key:the Key ' s name in Redis ' "" Self.server = Redis.
        Redis (Host=host, Port=port, db=db) # 2^31 = 256M # This is a limit value, the maximum is 256M, because in the Redis, the string value can stretch, when stretched, the blank position is filled with 0.
        Self.bit_size = 1 << # Redis with a maximum capacity of 512M, now uses 256M self.seeds = [5, 7, 11, 13, 31, 37, 61]
            Self.key = key Self.blocknum = Blocknum Self.hashfunc = [] for seed in Self.seeds: # k=7 A separate hash function based on seed self.hashfunc.append (Simplehash (Self.bit_size, Seed) # to determine if the element is in the set Def I Scontains (self, str_input): If not str_input:return False M5 = MD5 () m5.update (str_in
        Put.encode (' Utf-8 ') # First take the MD5 value of the target string str_input = M5.hexdigest ()ret = True name = Self.key + str (int (str_input[0:2),% self.blocknum) for F in Self.hashfunc: loc = F.hash (str_input) ret = ret & Self.server.getbit (name, loc) return RET # will Str_input The result of the mapping is written to a large string, that is, the associated flag bit def insert (self, str_input): M5 = MD5 () m5.update (Str_input.encode (' Utf-8 ') ) Str_input = m5.hexdigest () name = Self.key + str (int (str_input[0:2)% self.blocknum) for F In Self.hashfunc:loc = F.hash (str_input) # print (f "name = {name}, loc = {loc}") self. Server.setbit (Name, loc, 1) if __name__ = = ' __main__ ': # The first run will show not exists, and then run will show exists BF = Bloomfilter (
    If Bf.iscontains (' http://www.sina.com.cn/'): # To determine if a string exists print (' URL exists! ')
        Else:print (' url not exists! ') Bf.insert (' http://www.sina.com.cn/')

5.2. Modify the Scrapy-redis algorithm. 5.2.1. Analyze the source scheduling process.

# Dispatch Process: 1. The first step, dispatch file: scheduler.py open ()--> self.df = Load_object (self.dupefilter_cls)--> dupefilter_cls=defaults. scheduler_dupefilter_class--> scheduler_dupefilter_class = ' scrapy_redis.dupefilter.RFPDupeFilter ' # join dispatch queue Def Enqueue_request (self, request): If not request.dont_filter and Self.df.request_seen (Request): Self.df.log (requ EST, self.spider) return False if Self.stats:self.stats.inc_value (' Scheduler/enqueued/redis ', spider=

Self.spider) Self.queue.push (Request) return True is visible by using the Request_seen method of the Rfpdupefilter class in the Dupefilter file.

    2. Step two, go to the heavy file: dupefilter.py def request_seen (self, request): "" "Returns True if request was already seen. Parameters----------request:scrapy.http.Request Returns-------bool "" "FP = Self.reque
    St_fingerprint (Request) # This returns the number of values added, zero if already exists. Added = Self.server.sadd (Self.key, fp) return added = 0 can be knownTao Scrapy_redis is the use of set data structure to come and go heavy, to heavy object is request fingerprint.

    def request_fingerprint (Request, Include_headers=none): "" "return the request fingerprint. The request fingerprint is a hash this uniquely identifies the resource the request points to. For example, take the following two urls:http://www.example.com/query?id=111&cat=222 http://www.example.com/ Query?cat=222&id=111 even though those are two different URL both point to the same resource and are Equival

    ENT (ie. they should return the same response). Another example are cookies used to store session IDs.

    Suppose the following page is only accesible to authenticated users:http://www.example.com/members/offers.html Lot of sites use a cookies to store the session ID, which adds a random component to the HTTP Request and thus

    LD is ignored when calculating the fingerprint. For this reason, request headers are ignored by default when calculating the FINGEPRint. If you are want to include specific headers use the Include_headers argument, which are a list of Request headers to includ E. "" "If include_headers:include_headers = Tuple (To_bytes (H.lower ()) F or h in sorted (include_headers)) cache = _fingerprint_cache.setdefault (request, {}) if include_headers not in Cach E:FP = HASHLIB.SHA1 () fp.update (To_bytes (Request.method)) fp.update (To_bytes (canonicalize_url
                Uest.url)) Fp.update (request.body or B ") if Include_headers:for hdr in Include_headers: If HDR in Request.headers:fp.update (HDR) to V in request.headers.ge Tlist (HDR): Fp.update (v) cache[include_headers] = Fp.hexdigest () return Cache[include_ Headers] from the request_fingerprint can see, fingerprint in the end is what, in fact, is to use the HASHLIB.SHA1 () to the request object of some of the field information compression, with debugging can also be seen, In fact, FP is the request object. A string (40 characters, 0~f) encrypted after compression.

Summary, from the above scheduling process we can see that the modification point is Dupefilter.request_seen ()Function. 5.2.2. Modify the source code. Original file dupefilter.py

# Original file. \lib\site-packages\scrapy_redis\dupefilter.py Import logging import time from scrapy.dupefilters import Basedupe Filter from scrapy.utils.request import request_fingerprint from. Import defaults from. Connection Import Get_redis_from_settings logger = Logging.getlogger (__name__) # Todo:rename CL
Ass to Redisdupefilter.

    Class Rfpdupefilter (Basedupefilter): "" "Redis-based request duplicates filter.

    This class can also is used with default Scrapy ' s scheduler.

        "" "Logger = Logger def __init__ (self, server, key, Debug=false):" "" Initialize the duplicates filter. Parameters----------Server:redis.
        Strictredis the Redis server instance.
        Key:str Redis key Where to store fingerprints.

        Debug:bool, optional Whether to log filtered requests. "" "Self.server = Server Self.key = key Self.debug = Debug Self.logdupes = True @claSsmethod def from_settings (CLS, settings): "" "Returns a instance from given settings. This is uses by default the key ' Dupefilter:<timestamp> '.  When using the "Scrapy_redis.scheduler.Scheduler" class, this is not used as it needs

        Spider name in the key. Parameters----------settings:scrapy.settings.Settings Returns-------Rfpdupe


        Filter A Rfpdupefilter instance. "" "Server = Get_redis_from_settings (settings) # Xxx:this creates one-time key.  needed to support to use the # class as standalone dupefilter with Scrapy ' s default Scheduler # if Scrapy Passes spider on open () The This wouldn ' t is needed # Todo:use Scrapy_job env as default and fallback to time
        Stamp. Key = defaults. Dupefilter_key% {' timestamp ': Int (Time.time ())} debug = Settings.getbool (' Dupefilter_debug ') return CLS (s ErvEr, Key=key, debug=debug) @classmethod def from_crawler (CLS, crawler): "" "Returns instance from crawler. Parameters----------Crawler:scrapy.crawler.Crawler Returns-------R

        Fpdupefilter Instance of Rfpdupefilter. "" "Return Cls.from_settings (crawler.settings) def request_seen (self, request):" "" Returns True if re

        Quest was already seen.

        Parameters----------request:scrapy.http.Request Returns-------bool "" "FP = Self.request_fingerprint (Request) # This returns the number of values added, zero if already exi
        Sts. Added = Self.server.sadd (Self.key, fp) return added = = 0 def request_fingerprint (self, request): "" "R

        Eturns a fingerprint for a given request. Parameters----------request:scrapy.http.Request Returns-------STR "" "Return Request_fingerprint (Request) def close (self, reason= '):" "" "" "" "

        Se. Called by Scrapy ' s scheduler.
        Parameters----------reason:str, optional "" "Self.clear () def Clear (self):
        "" "clears fingerprints data." ""

        Self.server.delete (Self.key) def log (self, request, spider): "" "Logs given request.
        Parameters----------request:scrapy.http.Request spider:scrapy.spiders.Spider "" " If self.debug:msg = "filtered duplicate Request:% (Request) S" Self.logger.debug (msg, {' requ EST ': request}, extra={' Spider ': Spider}) elif self.logdupes:msg = ("Filtered Duplicate request%" (req Uest) S ""-no more duplicates would be shown "" (Click Dupefilter_debug to show All Du plicates) Self.logger.debug (msg, {' Request ': request}, extra={' Spider ': spider}) self.logdupes = False

Modified file: dupefilter.py

# modified file. \lib\site-packages\scrapy_redis\dupefilter.py Import logging import time from scrapy.dupefilters import basedu Pefilter from scrapy.utils.request import request_fingerprint. Import defaults from. Connection Import get_redis_from_settings Isusebloomfilter = False try:from. Bloomfilter import Bloomfilter except Exception as E:print (f "There is no bloomfilter, used the default Redis set to D
Upefilter. ")
Else:isusebloomfilter = True Logger = Logging.getlogger (__name__) # Todo:rename class to Redisdupefilter.

    Class Rfpdupefilter (Basedupefilter): "" "Redis-based request duplicates filter.

    This class can also is used with default Scrapy ' s scheduler.

        "" "Logger = Logger def __init__ (self, server, key, Debug=false):" "" Initialize the duplicates filter. Parameters----------Server:redis.
        Strictredis the Redis server instance. Key:str Redis key Where to store FINGERPRINts.

        Debug:bool, optional Whether to log filtered requests. "" "Self.server = Server Self.key = key Self.debug = Debug Self.logdupes = True # Use Bloonfilter to go to the URL to heavy if isusebloomfilter = = TRUE:SELF.BF = Bloomfilter () @classmethod def

        From_settings (CLS, settings): "" "Returns a instance from given settings. This is uses by default the key ' Dupefilter:<timestamp> '.  When using the "Scrapy_redis.scheduler.Scheduler" class, this is not used as it needs

        Spider name in the key. Parameters----------settings:scrapy.settings.Settings Returns-------Rfpdupe


        Filter A Rfpdupefilter instance. "" "Server = Get_redis_from_settings (settings) # Xxx:this creates one-time key. needed to support to use the # class as standalone Dupefilter with scrapY ' s default Scheduler # If Scrapy passes spider on open () method This wouldn ' t be needed # Todo:use scrap
        Y_job env as default and fallback to timestamp. Key = defaults. Dupefilter_key% {' timestamp ': Int (Time.time ())} debug = Settings.getbool (' Dupefilter_debug ') return CLS (s Erver, Key=key, Debug=debug) @classmethod def from_crawler (CLS, crawler): "" "Returns instance from Crawle
        R. Parameters----------Crawler:scrapy.crawler.Crawler Returns-------

        Rfpdupefilter Instance of Rfpdupefilter. "" "Return Cls.from_settings (crawler.settings) def request_seen (self, request):" "" Returns True if re

        Quest was already seen.

        Parameters----------request:scrapy.http.Request Returns-------bool ' ' ' if isusebloomfilter = = True: # Use Bloomfilter to rename the URL to the FP = self. Request_fingerprint (Request) if Self.bf.isContains (FP): # If there is already a return True Els
            E:self.bf.insert (FP) return False else: # Use Redis default set for weight
            fp = self.request_fingerprint (Request) # This returns the number of values added, zero if already exists. Added = Self.server.sadd (Self.key, fp) return added = = 0 def request_fingerprint (self, reques

        T): "" "Returns a fingerprint for a given request.

        Parameters----------request:scrapy.http.Request Returns-------Str "" "Return Request_fingerprint (Request) def close (self, reason= '):" "" "" "

        Called by Scrapy ' s scheduler. Parameters----------reason:str, optional "" "Self.clear () def clear (self) <

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Bloomfilter optimization Scrapy-redis to go heavy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Bloomfilter optimization Scrapy-redis to go heavy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support