Principles and Applications of bloomfilter
Principle of bloom Filter
Bloom filter is a space-efficient random data structure. Its principle is that when an element is added to a set, this element is mapped into k points in a Bit Array Using k independent hash functions, and they are set to 1. During retrieval, we only need to check if these vertices are all 1 and then (approximately) know whether there is any of them in the set. If there is any 0 Of These vertices, the retrieved element will definitely not be; if both are 1, The retrieved element is likely to exist.
The efficiency of the bloom filter has a certain price. when determining whether an element belongs to a set, it is possible that the elements that do not belong to this set are mistakenly considered to belong to this set (false positive ). Therefore, it is not suitable for applications with zero errors. In applications that can tolerate low error rates, the bloom filter exchanges a small number of errors for a huge savings in storage space.
Suppose you want to write a web crawler ). Due to the complexity of links between networks, crawlers crawling between networks may form a "ring ". To avoid forming a "ring", you need to know the URLs that the crawler has accessed. How do I know if the crawler has accessed a URL? Think about the following solutions:
- Save the accessed URL to the database.
- Use hashset to save the accessed URL. You can check whether a URL has been accessed at the price close to O (1.
- The URL is retained to the hashset or database after MD5 or SHA-1 hash.
- Bit-map method. Create a bitset to map each URL to a bit through a hash function.
Method 1 ~ 3. All accessed URLs are completely saved. Method 4 only marks a ing bit of the URL. The above methods can solve the problem perfectly when the data volume is small, but the problem arises when the data volume becomes very large:
Method 1: when the data volume becomes very large, the query efficiency of the relational database becomes very low. And every time a URL is sent, a database query is started, isn't it too trivial?
Method 2: too much memory consumption. As the number of URLs increases, memory usage increases. Even if there are only 0.1 billion URLs, each URL is 50 characters long and requires 5 GB of memory.
Method 3: Because the digest length of the string after MD5 processing is only bits and SHA-1 processing is only 160bits, method 3 saves several times of memory than method 2.
Method 4: memory consumption is relatively small, but the disadvantage is that the probability of a single hash function conflict is too high. Do you still remember the various solutions to hash table conflicts in data structure? To reduce the probability of a conflict to 1%, set the length of the bitset to 100 times the number of URLs. The difference between bloom filter and bit-map is that bloom filter uses k hash functions, and each string corresponds to k bits. This reduces the probability of conflict.
Create an M-bit bitset, Initialize all bits to 0, and then select k different hash functions. The result of hash function I on string STR is recorded as hi (STR), and meets the following requirements:
0 <= Hi (STR) <m (1 <= I <= K)
(1) The process of ing string STR to bitset: Calculate H1 (STR), H2 (STR ),..., HK (STR), and then set the corresponding position 1 in the bitset.
(2) check whether the STR string has been recorded by bitset: Calculate H1 (STR), H2 (STR ),..., HK (STR), and then check whether the corresponding bit in bitset is 1. If no one of them is 1, it can be determined that STR has never been recorded. If all bits are 1, the STR string is considered to exist. Note: There may also be misjudgment here, because it is possible that all the bits of the string are mapped to other strings. This case of dividing the string incorrectly is called false positive.
(3) When a string is deleted, it cannot be deleted after it is added, because deletion affects other strings.
To delete a string, you can use counting bloom filter (CBMs). This is a variant of the basic bloom filter, where each bit of the basic bloom filter is changed to a counter, in this way, you can delete strings.
Bloom filter parameter selection
Q: How can we set M (bit-map digits), n (number of strings to be processed), and K (number of Hash Functions?
When the number of hash functions is k = (ln2) * (M/N), the error rate is the minimum.
If the error rate is not greater than E, m> = N * log2 (1/E) * log2e.
The conclusion is provided here. If you are interested in the above formula derivation process, you can refer to it here.
For example, if the error rate is 0.001, then m should be 14 times larger than N. In this case, K is about 4.
Bloom Filter Application
Finally, summarize the advantages of bloom filter:
- Saves cache space (ing of null values) and does not need to map null values;
- Reduce the number of database or cache requests;
- Improve Service processing efficiency and business isolation.
Disadvantages:
- Probability of misjudgment;
- The traditional bloom filter cannot be used for deletion (you can use CBMs to support the deletion function ).
Bloom filter can be used to implement a data dictionary, to determine the data weight, or to calculate the intersection of data sets.
Reference:
Http://blog.csdn.net/v_july_v/article/details/6685894/
Http://blog.csdn.net/v_july_v/article/details/7382693
Http://www.dbafree.net /? P = 36
Https://github.com/jaybaird/python-bloomfilter/blob/master/pybloom/pybloom.py
Principles and Applications of bloomfilter