Bloom Filter (Bron filter)

Last Update:2015-08-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Bron filter (English: Bloom filter) was proposed by Bron in 1970. It is actually a very long binary vector and a series of random mapping functions. The Bron filter can be used to retrieve whether an element is in a collection. Its advantage is that space efficiency and query time are far more than the general algorithm, the disadvantage is that there is a certain rate of error recognition and removal difficulties.
If you want to determine whether an element is in a collection, it is common to think of saving all the elements in the collection and then determining by comparison. Lists, trees, hash tables (also known as hash tables, hash table) and other data structures are this way of thinking. But as the elements in the collection increase, we need more storage space. At the same time, the retrieval speed is also more and more slow, the retrieval time complexity of the above three structures is O (n), O (log n), O (n/k).

The principle of the Bron filter is that when an element is added to a set, the element is mapped to a K-point in a bit array by the K-hash function, which is set to 1. When retrieving, we just have to look at whether these points are all 1 (about) know if there is one in the collection: If any of these points have a 0, then the element being checked is not, and if it is 1, then the element is likely to be in. This is the basic idea of the Bron filter.

If a string corresponding to the bit is not all 1, then certainly not;
If a string corresponds to a bit full of 1, not necessarily, because it is possible that all the bits of the string are exactly the same as the other strings, this is known as false positive.

Note: After a string is added, it cannot be deleted because the deletion affects other strings.
If you need to delete a string, you can use counting Bloom filter (CBF), which is a variant of the basic Bloom filter, and CBF changes the basic bloom filter every bit to a counter, so that you can implement the ability to delete strings.
　　
From the above we can see that Bloom filter has three main parameters:
1) Number of stored strings n
2) Number of hash functions K
3) bit array size M

Let's analyze the following:
If you insert a string, the probability of the bit array being 0 on bit J is obviously (1-1/m) ^k
So we're inserting n strings, the bit array bits bit J on the 0 probability is obviously
(1-1/m) ^kn≈e^ (-kn/m)
That is, the probability that one of the 1 P is: 1-e^ (-kn/m), then for all K-hash functions corresponding to K-bit is 1 (conflict) probability F is: (1-e^ (-kn/m)) ^k
The extremum is obtained by the derivation of the k=ln2xm/n, and the minimum value is acquired when the p is about to be a M/2 bits 1, m/2 bits 0.
f = (1-p) ^k≈ (?) ^k = (?) ^ (ln 2) m/n≈ (0.6185) ^m/n

If M = 8n, then
K = 8 (ln 2) = 5.545 (use 6 hash functions)
F≈ (0.6185) m/n = (0.6185) 8≈0.02 (2% false positives)
Compare to a hash table:f≈1–e-n/m = 1-e-1/8≈0.11

add<t> (T item  {for  (int i = 0 ; i < K; i++) Array[hi (item )] = 1 ;} contains  <T> (T item ) {for  (int i = 0 ; i < K; i++) if  (!array[hi (item )]) return  false ; return  true ;}

Bloom filter saves a lot of storage space by allowing a small number of errors, but the drawbacks and advantages of the Bron filter are just as obvious. The error rate is one of them. As the number of elements deposited increases, the error rate increases. But if the number of elements is too small, the use of a hash table is sufficient.
In addition, it is generally not possible to remove elements from the Bron filter. It is easy to think of turning the bit array into an array of integers, each inserting an element corresponding to the counter plus 1, so that when the element is deleted, the counter is lost. However, it is not so easy to ensure that elements are safely removed. First we must make sure that the deleted elements are indeed inside the Bron filter. This is not guaranteed by this filter alone. In addition, the counter wrapping can also cause problems.

Counting Bloom Filter

Insertion:increment counter
Deletion:decrement counter
Overflow:keep bit 1 Forever

To avoid a count overflow, the count must have a sufficient number of digits.
We first calculate the probability that the first counter is increased by J, where N is the number of set elements, K is the number of hash functions, and M is the number of counter (corresponding to the size of the original bit array):

In the expression at the right end of the equation, the first part represents the selection of J times from the NK sub-hash, the middle part indicates that the J-Hash is selected for the I-counter, and the latter part indicates that the other nk–j hashes do not have the I counter selected. Therefore, the probability that the value of the first counter is greater than J can be limited to:
　　
　　　

add<t> (T item  {for  (int i = 0 ; i < K; i++) Array[hi (item )]++;} contains  <T> (T item ) {for  (int i = 0 ; i < K; i++) if  (!array[hi (item )])  return  false ;  return  true ;} Remove<t> (T item ) {for  (int i = 0 ; I < K; i++) Array[hi (item )]--; }

Application Examples:
1) HTTP cache server, web crawler, etc.
The main task is to determine whether a URL is in the existing set of URLs (you can think of the magnitude of the data here billion).
For an HTTP cache server, when a PC in the local Area network initiates an HTTP request, the cache server checks to see if the URL already exists in the cache, and if so, there is no need to pull the data to the original server (for simplicity, we assume that the data has not changed), This saves traffic and speeds up access to improve the user experience.
For web crawlers, to determine whether the currently processing Web page has been processed, it also requires that the current URL exists in the list of URLs already processed.

2) Junk Mail Filter
Assuming that the mail server filters spam messages through the sender's mail domain or IP address, it is necessary to determine whether the current mail domain or IP address is blacklisted. You can also use the Bloom filter algorithm if the mail server has a very large number of communications messages (and you can think of data levels of billions).

Bloom Filter (Bron filter)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Bloom Filter (Bron filter)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Bloom Filter (Bron filter)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support