Project Introduction
Contains a redis based on the implementation of the filter, as well as the application to the Scrapy demo.
Address: Bloomfilterredis prum Filter
There are many introductions on the internet, "The beauty of Mathematics", the introduction of very detailed, here no longer repeat. hash Function
n hash functions are required in the Prum filter, and I am using the common hash function provided by Arash Partow. a Redis filter built on the wall
One of the data structures in Redis is called bitmap (with an official website explanation below), which provides a bit array with a maximum length of 512MB (2^32). We can provide it to prum filter to do bit array.
According to the data given in the beauty of mathematics, in the case of using 8 hash functions, a 512MB sized bit array can weigh about 200 million of the URL with a false positive rate of five out of 10,000. And if the simple use Set () to heavy words, with a url64 byte, 200 million URL about 128GB of memory space, dare not imagine.
The strategy I use is to use the hash function to calculate the 2^32 modulo, and fill in the bitmap. the bitmap of Reids
The following content translation from the official website Http://www.redis.cn/topics/data-types-intro.html#bitmaps,
English proficiency is limited, some places chose the transliteration, the big guy passes also please not hesitate to enlighten, first thanked ~
Bitmap is not an exact data type, but rather a series of bitwise-oriented operations based on a string type definition. Because string is binary safe and their maximum length is 512MB,
So the string type is appropriate to be an array of bits for a 2^32 length.
Bitwise manipulation methods can be divided into two groups: first, the operation of a single bit, such as setting a number of 1 or 0, or to get the value of this bit, two, to a set of bits of the operation, such as the calculation of a certain range of 1 (such as counting)
One of the biggest advantages of bitmap is that it usually saves a lot of space when storing information. For example, a system that uses an incremental ID to identify a user can use just 512MB of space to identify whether 4 billion users want to be notified.
Use the Setbit and Getbit commands to count and retrieve bits:
> Setbit key 1
(integer) 1
> Getbit key
(integer) 1
> Getbit key
(integer) 0
Setbit, as shown above, means that the 10th digit is set to 1, and the second argument can be 0 or 1. If the set bit exceeds the length of the current string, it grows automatically. (Longest 2^32, same below)
Getbit, as shown above, returns the 10th and 11th digits, 1 and 0, respectively. If the lookup bit exceeds the length of the current string, 0 is returned.
Next are three commands for manipulating a set of bits:
Bitop performs a bitwise operation between different strings. The provided operation has And,or,xor and not. Bitcount
Bitcount count, returns the number of bits in the bitmap with a value of 1.
Bitpos returns the position of the first 0 or 1
Bitpos and Bitcount can be used not only in the entire bitmap, but also in a certain range, the following is an example of a bitcount:
> Setbit Key 0 1
(integer) 0
> Setbit key 1
(integer) 0
> Bitcount key
(integer) 2
The application example is slightly ... integration into Scrapy
In the Scrapy, the Dupefilter_class configuration filter can be used in settings.py, and the example project is given on GitHub. Validate
The initial validation strategy is to use the Scrapy framework from a Baidu Encyclopedia page, extract the links to other encyclopedia entries on the page, and filter the URLs in the filter to record the local files
In Filted.txt, the result of the correct request is deposited into the MongoDB.
In the initial test, it was found that 99% of the filtered URLs recorded in the filted.txt were not within MongoDB, noting that about 10,000 URLs had been filtered out, and only 300 of the MongoDB
Use the Bitcount key command to view the number of bits in the bitmap in Redis with a value of 1, found to be about 100,000. And the 300 data in the MongoDB is at most 300*8=2400 bit, it is obvious
Then there's a problem. After analysis, this is because a large number of filter requests still exist in the Scrapy request queue, has not been issued, so MongoDB will not have a corresponding record.
Therefore, in order to verify the reliability of the Prum filter, all URLs are deposited in the Allurl.txt file before the filter filter, and the URL within the file reaches a certain scale, and filted.txt
Corresponding processing-write a script to calculate the false positives rate. After the verification, in the Baidu Encyclopedia about 700,000 URLs for processing, filtering about 400,000, the number of false error is 0. This verification scale is very small, but the author does not have the concrete big recently
Scale to heavy demand, welcome to have the needs or interested colleagues to use and feedback.