Bloom Filter Concepts and principles

Source: Internet
Author: User

Bloom Filter is a space-efficient random data structure that uses a bit array to represent a collection very succinctly and to determine whether an element belongs to the collection. The high efficiency of Bloom Filter is a cost: when judging whether an element belongs to a set, it is possible to mistakenly think of elements that are not part of this set as belonging to this set (false positive). Therefore,Bloom Filter is not suitable for those "0 error" applications. In applications where low error rates can be tolerated, theBloom Filter provides significant savings in storage space with minimal errors.

collection Representations and element queries

Let's take a look at how Bloom Filter uses bit arrays to represent collections. In the initial state, theBloom Filter is an array of bits with m bits, each of which is set to 0.

In order to expresss={x1, x2,..., xn}Such aNA collection of elements,Bloom FilterUsekA hash function that is independent of each other (Hash Function), which map each element in the collection to the{1,..., m}In the range. For any one elementxTheIThe location of a hash function maphi (x)will be placed as1(1≤I≤k)。 Note that if a position is set multiple times1, then only the first time will work, and the next few times will have no effect. In the,k=3, and two hash functions are selected in the same position (fifth digit from the left).   

When judging if y belongs to this set, we apply K -times hash function to y , if all hi (y) positions are 1(1≤i≤ k), then we assume that y is the element in the collection, otherwise y is not the element in the collection. The y1 is not an element in the collection. y2 either belongs to this set or is just a false positive.

Error Rate Estimation

As we mentioned earlier,Bloom Filter has a certain error rate (false positiverates) when judging whether an element belongs to the set it represents, and below we estimate the size of the error rate. Before estimating to simplify the model, we assume that kn<m and the individual hash functions are completely random. When all elements of the collection s={x1, x2,..., xn} are mapped to a bit array of m bits by a K hash function, the probability of one or 0 of the bit array is:

Where 1/m represents the probability that any hash function selects this bit (provided that the hash function is completely random),(1-1/m) means that the hash does not have the probability of selecting this bit at a time. To fully map S into the array in place, the kn hash is required. One or 0 means that the kn hash is not selected, so this probability is the (1-1/m) kn . p = e-kn/m is to simplify the operation, here is used to calculate e when the approximate:

So that the ρ is the ratio of 0 in the bit array, then the mathematical expectation of ρ is e (ρ) = p '. In the case where ρ is known, the error rate required (false positiverates) is:

(1-ρ) is the ratio of 1 in the bit array,(1-ρ) K means that the K -times hash is exactly the area of the 1 , which is false positiverate. The second approximation of the above is already mentioned in the previous step, and the first approximation is now. P ' is just the mathematical expectation of ρ, in practice the value of ρ may deviate from its mathematical expectations. M. Mitzenmacher has proved [2] that the proportion of 0 in a bit array is very centrally distributed around its mathematical expectations. Thus, the approximation of the first step is established. Put P and p ' into the above, respectively, to:

Using P and F is usually more convenient for analysis than P ' and F '.

optimal number of hash functions

Since Bloom Filter relies on multiple hash functions to map the set into an array, how many hash functions should be selected to minimize the error rate when querying an element? There are two reasons for mutual exclusion: if the number of hash functions is large, then the probability of getting 0 when querying an element that does not belong to a set is great; on the other hand, if the number of hash functions is few, then 0 in the bit array is more. In order to get the optimal number of hash functions, we need to calculate the error rate formula in the previous section.

Calculate with p and F first. Notice that f = exp (k ln (1−e−kn/m)), we make g = k ln (1−e−kn/m), as long as the G is minimized, andF naturally takes the minimum. Since p = e-kn/m, we can write the g

According to the law of symmetry, it is easy to see when p =A, that is, k = LN2 (m/n) ,G obtains the minimum value. In this case, the minimum error rate f equals (a) k ≈ (0.6185) m/n. Also, notice that P is the probability that one of the digits in the bit array is still 0, so p = A is equivalent to half of 0 and 1 in the bit array. In other words, to keep the error rate low, it is best to give way to half of the array is empty.

One thing to emphasize is thatp = Minimum error rate at the end of the time this result does not depend on the approximate values p and F. Likewise for f ' = exp (k ln (1− (1−1/m) kn)),g ' = k ln (1− (1−1/m) kn),p ' = (1−1/m) kn, we can put G ' written

Also according to the principle of symmetry can be obtained when p ' = A,G ' obtains the minimum value.

the size of the bit array

Let's take a look at the minimum number of bits required for theBloom Filter to represent the set of any n elements in a full set, without exceeding a certain error rate. Assuming that the complete party has u elements, the maximum allowable error rate is ?, let's find the bit number of bits in the array m.

AssumeXFor the full set of any takeNA collection of elements,F (X)is to indicateXArray of bits. So for the collectionXAny one of the elementsxIns = F (X)Query inxCan get a positive result, i.e.scan acceptx。 Obviously, becauseBloom FilterIntroduced an error,sTo accept is not onlyXelement, it can also? (u-n)Afalse Positive。 Therefore, for a definite bit array, it can accept a total ofn +? (u-n)An element. Inn +? (u-n)Elements,sThe only one that really representsN, so a definite array of bits can represent

A collection. The bit arrays of m bits have a total of 2m different combinations that can be rolled out, and the bit arrays ofm bits can represent

A collection. The set of n elements in a full set has a total of

So that the bit array of m bits can represent a collection of all n elements, there must be

That

The approximate premise in the above is that n and u are very small, which is often the case in reality. According to the above formula, we come to the conclusion that the error rate is not greater than ? The case,m must be at least equal to n log2 (1/?) To represent a collection of any n elements.

In the previous section we worked out that when k = LN2 (m/n) When the error rate F is the smallest, then f = (a) K = (mln2/n). Now the f≤?can be launched

This result is more than we calculate in the Nether n log2 (1/?) Big log2 e ≈ 1.44 times. This shows that when the number of hash functions is optimal, to let the error rate not exceed,m need to take at least the minimum value of 1.44 times.

Summary

In computer science, we often encounter the time-changing space or space-time situation, that is, in order to achieve a certain aspect of the best to sacrifice another aspect. Bloom Filter introduces another factor in addition to the time space factor: the error rate. When using Bloom Filter to determine whether an element belongs to a collection, there is a certain error rate. That is, it is possible to mistakenly think of elements that do not belong to this set as belonging to this set (false Positive), but do not mistakenly think that the elements belonging to this set are not part of this set (false negative). After adding the error rate to this factor,Bloom Filter saves a lot of storage space by allowing a small number of errors.

Bloom filter has been widely used in spell checking and database systems since Burton Bloom introduced Bloom filter in the years. In the past ten or twenty years, with the popularization and development of the network,Bloom filter has gained new life in the network field, and various Bloom filter variants and new applications are appearing continuously. It is foreseeable that with the deepening of the network application, new variants and applications will continue to emerge, andBloom Filter will be more developed.

Ii. Scope of application

Can be used to implement the data dictionary, the data of the weight, or set to find the intersection

Basic principle and key points     for the principle is very simple, bit array +k a separate hash function. The bit array of the value corresponding to the hash function is set to 1, and if it is found that all the corresponding bits of the hash function are 1, it is clear that this process does not guarantee that the result of the lookup is 100% correct. It is also not supported to delete a keyword that has already been inserted, because the bit that corresponds to the keyword affects other keywords. So a simple improvement is counting Bloom filter, which can support deletion by replacing the bit array with a counter array.       There is a more important question, how to determine the size of the bit array m and the number of hash functions according to the number of input elements N. The error rate is minimized when the number of hash functions is k= (LN2) * (m/n). In cases where the error rate is not greater than E, M must be at least equal to N*LG (1/e) to represent a collection of any n elements. But M should also be larger, because the bit array is also guaranteed to be at least half 0, then M should &GT;=NLG (1/e) *lge is probably NLG (1/e) 1.44 times times (LG represents 2 logarithm).    For example, we assume that the error rate is 0.01, then M should be about 13 times times the N. So k is probably 8.       Note here m is different from N's units, M is bit, and N is the number of elements (exactly the number of different elements). The length of a single element is usually a lot of bits. So the use of Bloom filter memory is usually saved.    extension     Bloom filter to map the elements in the collection into the array, with K (k for the hash function number) map bit whether all 1 indicates that the element is not in this collection. Counting Bloom Filter (CBF) expands each bit in the bit array to a counter, enabling the deletion of the element. Spectral Bloom Filter (SBF) associates it with the number of occurrences of the collection element. SBF uses the minimum value in counter to approximate how often the element appears.    problems     give you a A, b two files, each store 5 billion URLs, each URL occupies 64 bytes, memory limit is 4G, let you find the common URL of a, b file. What if it's three or even n files?    based on this problem we calculate the memory footprint, 4g=2^32 is probably 4 billion *8 is probably 34 billion, N=50 billion, if the error rate 0.01 is required is probably 65 billion bit. Now available is 34 billion, the difference is not much, this may cause the error rate to rise some. In addition, if these urlip are one by one corresponding, they can be converted to IP, it is much simpler.  References

[1] A. Broder and M. Mitzenmacher. Network Applications of Bloom FILTERS:A survey. Internet Mathematics, 1 (4): 485–509, 2005.

[2] M. Mitzenmacher. Compressed Bloom Filters. IEEE/ACM Transactions on Networking 10:5 (2002), 604-612.

[3] Www.cs.jhu.edu/~fabian/courses/CS600.624/slides/bloomslides.pdf

[4] Http://166.111.248.20/seminar/2006_11_23/hash_2_yaxuan.ppt

http://blog.csdn.net/jiaomeng/article/details/1495500

Bloom Filter Concepts and principles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.