Bloom Filter Algorithm Introduction (add counting Bloom filter content)

Source: Internet
Author: User

The Chinese translation of Bloom filter, called the filter, was proposed by Bron in 1970. It is actually a very long binary vector and a series of random mapping functions. The Bron filter can be used to retrieve whether an element is in a collection. Its advantage is that space efficiency and query time are far more than the general algorithm, the disadvantage is that there is a certain rate of error recognition and removal difficulties. As described in the article title, this article is only to do a brief introduction, belongs to the popular Science article.

Application Scenarios
Before formally introducing the Bloom filter algorithm, let's take a look at when to use the Bloom filter algorithm.
1. HTTP cache server, web crawler, etc.
The main task is to determine whether a URL is in the existing set of URLs (you can think of the magnitude of the data here billion).
For an HTTP cache server, when a PC in the local Area network initiates an HTTP request, the cache server checks to see if the URL already exists in the cache, and if so, there is no need to pull the data to the original server (for simplicity, we assume that the data has not changed), This saves traffic and speeds up access to improve the user experience.
For web crawlers, to determine whether the currently processing Web page has been processed, it also requires that the current URL exists in the list of URLs already processed.

2. Junk Mail filtering
Assuming that the mail server filters spam messages through the sender's mail domain or IP address, it is necessary to determine whether the current mail domain or IP address is blacklisted. You can also use the Bloom filter algorithm if the mail server has a very large number of communications messages (and you can think of data levels of billions).

Several professional terms
It is necessary to introduce the concept of false positive and false negative (a more descriptive description can read the 4th reference).
False positive Chinese can be understood as "false positive", the image of a point is "false alarm", the following will be said Bloom filter existence false alarm situation, real life is also wrong, such as go to physical examination, the doctor told you XXX test is positive, and is actually negative, In other words, false positives, false positive, anti-virus software false alarm is the same concept.
False negative, Chinese can be understood as "false negative", the image of a point is "false negatives." The doctor told you that XXX test negative, in fact you are positive, you are sick (Sorry, it's just a joke), that is false negatives. The same anti-virus software also exists in the case of false negatives.

Bloom Filter algorithm
Well, finally to formally introduce the Bloom filter algorithm.
In its initial state, the Bloom filter is a bit array of M bits, and the array is populated with 0. At the same time, we need to define a k different hash functions, each of which randomly maps each input element to a bit in the array. So for a certain input, we'll get the K index.

Insert element: After the K hash function mapping, we will get the K index, we put the bit array in this K position all 1 (regardless of the bit before the 0 or 1)

Query element: The input element through K hash function mapping will get K index, if the bit array of this K index anywhere is 0, then it means that the element is not in the collection, if the element is in the collection, then when the element is inserted when the K bits are 1. But if the bits at the K index are all 1, the element being queried must be in the collection? The answer is not necessarily, that is, the case of false positive (but bloom filter does not appear false negative case)

In, when inserting x, Y, z these three elements, then query W, will find that W is not in the collection, and if w after three hash function calculated results of the index of the bit is all 1, then bloom Filter will tell you, W in the collection, in fact, here is a false alarm, W is not in the collection.

False Positive Rate
How big is the false-positive rate of the Bloom filter? The following is a mathematical elaboration. It is possible to assume that the index value of the hash function output falls on each of the array of M-bits. So, for a given hash function, the probability that a particular bit is not set to 1 at the time of an operation is

So, for all k-hash functions, the probability that this bit is set to 1 is

If we have already inserted n elements, then for a given bit, this bit is still 0 probability is

So, if you insert n elements, the probability of this bit being 1 is

If there is a false positive for a particular element, then the hash function of this element is all K-index 1, the probability is

According to the definition of the constant E, it can be approximated as:

About false positives
Sometimes false positives do not have too much impact on the actual operation, such as the HTTP cache server, if a URL is mistaken for the existence of the cache server, then when the data will not be fetched, and ultimately to be taken from the original server, before the record into the cache server, There is little to be accepted.
For security software, there is "another can be wrong, not false report," said, if you put a normal software to be mistaken for a virus, the user will not have any impact (if users believe it is a virus, then is to delete this file, if the user is determined to execute, then the consequences can only be borne by users) If you false negative a virus, the consequences for the user are not conceivable ... What's more, false positives can somehow make some users feel that you are professional ...

optimal number of hash functions

Since Bloom filter relies on multiple hash functions to map the set into an array, how many hash functions should be selected to minimize the error rate when querying an element? There are two reasons for mutual exclusion: if the number of hash functions is large, then the probability of getting 0 when querying an element that does not belong to a set is great; on the other hand, if the number of hash functions is few, then 0 in the bit array is more. In order to get the optimal number of hash functions, we need to calculate the error rate formula in the previous section.

Calculate with P and F first. Notice that f = exp (k ln (1−e−kn/m)), we make G = k ln (1−e−kn/m), as long as the G is minimized, and F naturally takes the minimum. Since p = e-kn/m, we can write the G

According to the law of symmetry it is easy to see when p = 1/2, which is k = LN2 (m/n), G obtains the minimum value. In this case, the minimum error rate F equals (k≈) of the m/n (0.6185). Also, notice that P is the probability that one of the bits in the array is still 0, so p = 1/2 corresponds to half of 0 and 1 in the bit array. In other words, to keep the error rate low, it is best to give way to half of the array is empty.

One thing to emphasize is that the minimum error rate at p = 1/2 does not depend on the approximate value P and f. Likewise for f ' = exp (k ln (1− (1−1/m) kn)), g ' = k ln (1− (1−1/m) kn), p ' = (1−1/m) kn, we can write G '

Also according to the principle of symmetry can be obtained when p ' = 1/2, G ' obtains the minimum value.

the size of the bit array

Let's take a look at the minimum number of bits required for the Bloom filter to represent the set of any n elements in a full set, without exceeding a certain error rate. Assuming that the complete party has U elements, the maximum allowable error rate is?, let's find the bit number of bits in the array m.

Suppose X is a set of all n elements in a full set, and F (x) is an array of bits representing X. So for any element x in the set X, querying x in S = F (x) will give a positive result, that is, s can accept x. Obviously, because Bloom Filter introduces an error, S can accept more than just the elements in X, it can also? (U-n) a false positive. So, for a certain bit array, it can accept a total of n +? (u-n) elements. In N +? (u-n) element, s really represents only n, so a definite bit array can represent

A collection. The bit arrays of M bits have a total of 2m different combinations that can be rolled out, and the bit arrays of M bits can represent

A collection. The set of n elements in a full set has a total of

So that the bit array of M bits can represent a collection of all n elements, there must be

That

The approximate premise in the above is that N and u are very small, which is often the case in reality. According to the above formula, we conclude that if the error rate is not greater than, m must be at least equal to n log2 (1/?). To represent a collection of any n elements.

In the previous section we worked out that when k = LN2 (m/n) When the error rate F is the smallest, then F = (a) K = (mln2/n). Now make f≤, can launch

This result is more than we calculate in the Nether N log2 (1/?) Big log2 e≈1.44 times. This shows that when the number of hash functions is optimal, to let the error rate not exceed, m need to take at least 1.44 times times the minimum value.

Summary

In computer science, we often encounter the time-changing space or space-time situation, that is, in order to achieve a certain aspect of the best to sacrifice another aspect. Bloom Filter introduces another factor in addition to the time space factor: the error rate. When using Bloom filter to determine whether an element belongs to a collection, there is a certain error rate. That is, it is possible to mistakenly think of elements that do not belong to this set as belonging to this set (false Positive), but do not mistakenly think that the elements belonging to this set are not part of this set (false negative). After adding the error rate to this factor, Bloom filter saves a lot of storage space by allowing a small number of errors.

Since Burton Bloom introduced Bloom filter in the 70 's, Bloom filter has been widely used in spell checking and database systems. In the past ten or twenty years, with the popularization and development of the network, Bloom filter has gained new life in the network field, and various Bloom filter variants and new applications are appearing continuously. It is foreseeable that with the deepening of the network application, new variants and applications will continue to emerge, and Bloom filter will be more developed.

Counting Bloom Filter

As you can see from the introduction of Bloom filter, the standard Bloom filter is a very simple data structure that only supports inserting and locating two operations. The standard Bloom filter works well when the set to be expressed is a static collection, but if the set of expressions to be expressed changes frequently, the disadvantage of the standard Bloom filter appears because it does not support delete operations.

The appearance of the counting Bloom filter solves this problem by extending each bit of the standard Bloom filter bit array to a small counter (Counter), adding 1 to the value of the corresponding K (k for hash function) Counter respectively when inserting the element. The value of the corresponding K-counter is reduced by 1 respectively when the element is deleted. The Counting Bloom Filter adds a delete operation to Bloom filter by the cost of multiple times more storage space. The next question is, in the end, how many times will it take to occupy?

We first calculate the probability that the first counter is increased by J, where N is the number of set elements, K is the number of hash functions, and M is the number of counter (corresponding to the size of the original bit array):

In the expression at the right end of the equation, the first part represents the selection of J times from the NK sub-hash, the middle part indicates that the J-Hash is selected for the I-counter, and the latter part indicates that the other nk–j hashes do not have the I counter selected. Therefore, the probability that the value of the first counter is greater than J can be limited to:

The Stirling formula for estimating factorial is applied in the second-step scaling of the above:

In the Bloom filter concepts and principles article, we mentioned that the optimal value of K is (LN2) m/n, and now we limit k≤ (LN2) m/n, we can get the following conclusions:

If each counter is assigned 4 bits, it overflows when the value of counter reaches 16 o'clock. This probability is:

This value is small enough, so for most applications, 4-bit is sufficient.

Reference documents


Http://zh.wikipedia.org/wiki/Bloom_filter
Http://en.wikipedia.org/wiki/Bloom_filter
Http://www.cnblogs.com/yuyijq/archive/2012/02/08/2343374.html

http://simon.blog.51cto.com/80/73395/

A. Broder and M. Mitzenmacher. Network Applications of Bloom FILTERS:A survey. Internet Mathematics, 1 (4): 485–509, 2005.

M. Mitzenmacher. Compressed Bloom Filters. IEEE/ACM Transactions on Networking 10:5 (2002), 604-612.

Www.cs.jhu.edu/~fabian/courses/CS600.624/slides/bloomslides.pdf

http://my.oschina.net/kiwivip/blog/133498

Bloom Filter Algorithm Introduction (add counting Bloom filter content)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.