Concepts and Principles of Bloom Filter and bloomfilter

Last Update:2017-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Bloom Filter is a space-efficient random data structure. It uses a bit array to easily represent a set and determines whether an element belongs to the set. The efficiency of the Bloom Filter has a certain price: when determining whether an element belongs to a set, it is possible that the elements that do not belong to this set are mistakenly considered to belong to this set (false positive ). Therefore, the Bloom Filter is not suitable for applications with zero errors. In applications that can tolerate low error rates, the Bloom Filter exchanges a small number of errors for a huge savings in storage space. Set representation and element Query

Next, let's take a look at how the Bloom Filter represents a set with a bit array. In the initial state, the Bloom Filter is an array containing m bits, and each bit is set to 0.

To express S = {x1, x2 ,..., Xn} is a set of n elements. Bloom Filter uses k Hash functions to map each element in the set to {1 ,..., M} range. For any element x, the position hi (x) mapped by hash function I is set to 1 (1 ≤ I ≤ k ). Note: If a location is set to 1 multiple times, only the first time will take effect, and the next few times will not have any effect. In, k = 3, and two hash functions select the same position (the fifth digit on the left ).

When judging whether y belongs to this set, we apply k hash functions to y. If all hi (y) locations are 1 (1 ≤ I ≤ k ), then we think y is the element in the Set, otherwise we think y is not the element in the set. Y1 is not an element in the set. Y2 or belongs to this set, or it is just a false positive.

Error Rate Estimation

As we have mentioned above, Bloom Filter has a certain error rate (false positive rate) when determining whether an element belongs to the set it represents. Next we will estimate the error rate. Before estimation, to simplify the model, we assume that kn <m and each hash function is completely random. When the set S is {x1, x2 ,..., When all elements of xn} are mapped to m-bit arrays by k hash functions, the probability of a bit or 0 in this array is:

1/m indicates the probability that any hash function selects this bit (provided that the hash function is completely random), (1-1/m) indicates the probability that this digit is not selected for a hash operation. To fully map S to the array, kn hash is required. A bit or zero means that the kn hash is not selected, so this probability is the kn power (1-1/m. P = e-kn/m is used to simplify the computation. Here we use the commonly used approximation for ecomputing:

If p is the ratio of 0 in the bit array, the mathematical expectation of p is E (p) = p '. The required error rate (false positive rate) is:

(1-P) is the ratio of 1 in the Bit Array. (1-p) k indicates that all k hash operations are in the selected region of 1, that is, false positive rate. In the above formula, the second step is similar as mentioned above. Now let's look at the first step. P' is just the mathematical expectation of p. In reality, the value of p may deviate from its mathematical expectation. M. Mitzenmacher has proved that the ratio of 0 in a bit array is very concentrated in the vicinity of its expected mathematical value. Therefore, the first step of approximation can be established. Put p and p into the above formula, respectively:

Compared with P' and F', using p and f is usually more convenient in analysis. Optimal Number of Hash Functions

Since Bloom Filter relies on multiple hash functions to map the set to an array, how many hash functions should be selected to minimize the error rate during element query? There are two mutually exclusive reasons: if there are too many hash functions, the probability of getting 0 when querying an element that does not belong to the set is high; but on the other hand, if the number of hash functions is small, there will be more 0 in the bit array. To obtain the optimal number of hash functions, we need to calculate according to the error rate formula in the previous section.

Use p and f for calculation. Note that f = exp (k ln (1 −e −kn/m), we make g = k ln (1 −e −kn/m), as long as g gets the minimum, f is naturally the smallest. Since p = e-kn/m, we can write g

According to the symmetry law, we can easily see that when p = 1/2, that is, k = ln2 · (m/n), g gets the minimum value. In this case, the minimum error rate f is equal to (1/2) k ≈ (0.6185) m/n. In addition, note that p is the probability that one of the bits in the array is still 0, so p = 1/2 corresponds to each half of 0 and 1 in the bit array. In other words, to keep the error rate low, it is best to leave half of the Bit Array empty.

It should be emphasized that the result of the minimum error rate when p = 1/2 does not depend on the approximate values p and f. Similarly for F' = exp (k ln (1−1/m) kn), G' = k ln (1−1/m) kn), p' = (1−1/m) kn. We can write G'

Similarly, we can obtain the minimum value of G' when P' = 1/2. Bit Array size

Let's take a look at how many bits are required for the Bloom Filter to represent the set of any n elements in the complete set without exceeding a certain error rate. Assume that there are a total of u elements in the complete set, and the maximum error rate is allowed is bytes. Next we will calculate the number of m digits in the bit array.

Assume that X is a set of n elements in the complete set, and F (X) is a bit array of X. Then, for any element X in the set x, querying X in s = F (x) can get positive results, that is, s can accept x. Apparently, because the Bloom Filter introduces an error, s can accept not only the elements in X, but also the random (u-n) false positive. Therefore, for a definite Bit Array, it can accept a total of n + bytes (u-n) elements. In the n + percentile (u-n) elements, s actually represents only n of them, so a definite bit array can represent

. The m-bit bitwise array has a total of 2 MB different combinations, which can be introduced. The m-bit bitwise array can represent

. The n-Element Set in the complete set has a total

So to enable the m-Bit Array to represent the set of all n elements, there must be

That is:

In the above formula, the approximate premise is that n is relatively small than є u, which is often used in actual situations. Based on the above formula, we can conclude that m must at least be equal to n log2 (1/second) to represent the set of any n elements when the error rate is not greater than limit.

In the previous section, we calculated that when k = ln2 · (m/n), the error rate f is the smallest, and then f = (1/2) k = (1/2) mln2/n. Now let f be less than or equal to the limit, which can be released

This result is 1.44 times larger than the lower bound n log2 (1/second. This means that when the number of hash functions gets the optimum, the error rate must not exceed the limit, and m must at least obtain 1.44 times of the minimum value. Summary

In computer science, we often encounter the situation of Time-to-space or space-to-time, that is, to achieve the best of one aspect and sacrifice another aspect. Bloom Filter introduces another factor in addition to the two factors: the error rate. When you use the Bloom Filter to determine whether an element belongs to a specific set, there will be a certain error rate. That is to say, it is possible to mistakenly think of elements that do not belong to this set as belonging to this set (False Positive), but not those that do not belong to this set (False Negative ). After increasing the error rate, the Bloom Filter allows a small number of errors to save a lot of storage space.

Since Burton Bloom proposed the Bloom Filter in 1970s, the Bloom Filter has been widely used in spelling checks and database systems. In the past 10 or 20 years, with the popularization and development of the network, Bloom Filter has gained a new life in the network field, and various Bloom Filter variants and new applications have emerged. It is foreseeable that with the continuous development of network applications, new variants and applications will continue to emerge, and Bloom Filters will surely achieve greater development. References

[1] A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1 (4): 485-509,200 5.

[2] M. Mitzenmacher. Compressed Bloom Filters. IEEE/ACM Transactions on Networking (2002), 604-612.

[3] www.cs.jhu.edu /~ Fabian/courses/CS600.624/slides/bloomslides.pdf

[4] http: // 166.111.248.20/seminar/2006_11_23/hash_2_yaxuan.ppt

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Concepts and Principles of Bloom Filter and bloomfilter

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Concepts and Principles of Bloom Filter and bloomfilter

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support