Bloom Filter (Bron filter)

Source: Internet
Author: User

The Bron filter (English: Bloom filter) was proposed by Bron in 1970. It is actually a very long binary vector and a series of random mapping functions. The Bron filter can be used to retrieve whether an element is in a collection. Its advantage is that space efficiency and query time are far more than the general algorithm, the disadvantage is that there is a certain rate of error recognition and removal difficulties.
If you want to determine whether an element is in a collection, it is common to think of saving all the elements in the collection and then determining by comparison. Lists, trees, hash tables (also known as hash tables, hash table) and other data structures are this way of thinking. But as the elements in the collection increase, we need more storage space. At the same time, the retrieval speed is also more and more slow, the retrieval time complexity of the above three structures is O (n), O (log n), O (n/k).

The principle of the Bron filter is that when an element is added to a set, the element is mapped to a K-point in a bit array by the K-hash function, which is set to 1. When retrieving, we just have to look at whether these points are all 1 (about) know if there is one in the collection: If any of these points have a 0, then the element being checked is not, and if it is 1, then the element is likely to be in. This is the basic idea of the Bron filter.

  

If a string corresponding to the bit is not all 1, then certainly not;
If a string corresponds to a bit full of 1, not necessarily, because it is possible that all the bits of the string are exactly the same as the other strings, this is known as false positive.

Note: After a string is added, it cannot be deleted because the deletion affects other strings.
If you need to delete a string, you can use counting Bloom filter (CBF), which is a variant of the basic Bloom filter, and CBF changes the basic bloom filter every bit to a counter, so that you can implement the ability to delete strings.
  
From the above we can see that Bloom filter has three main parameters:
1) Number of stored strings n
2) Number of hash functions K
3) bit array size M

Let's analyze the following:
If you insert a string, the probability of the bit array being 0 on bit J is obviously (1-1/m) ^k
So we're inserting n strings, the bit array bits bit J on the 0 probability is obviously
(1-1/m) ^kn≈e^ (-kn/m)
That is, the probability that one of the 1 P is: 1-e^ (-kn/m), then for all K-hash functions corresponding to K-bit is 1 (conflict) probability F is: (1-e^ (-kn/m)) ^k
The extremum is obtained by the derivation of the k=ln2xm/n, and the minimum value is acquired when the p is about to be a M/2 bits 1, m/2 bits 0.
f = (1-p) ^k≈ (?) ^k = (?) ^ (ln 2) m/n≈ (0.6185) ^m/n

If M = 8n, then
K = 8 (ln 2) = 5.545 (use 6 hash functions)
F≈ (0.6185) m/n = (0.6185) 8≈0.02 (2% false positives)
Compare to a hash table:f≈1–e-n/m = 1-e-1/8≈0.11

add<t> (T item  {for  (int i = 0 ; i < K; i++) Array[hi (item )] = 1 ;} contains  <T> (T item ) {for  (int i = 0 ; i < K; i++) if  (!array[hi (item )]) return  false ; return  true ;}  

Bloom filter saves a lot of storage space by allowing a small number of errors, but the drawbacks and advantages of the Bron filter are just as obvious. The error rate is one of them. As the number of elements deposited increases, the error rate increases. But if the number of elements is too small, the use of a hash table is sufficient.
In addition, it is generally not possible to remove elements from the Bron filter. It is easy to think of turning the bit array into an array of integers, each inserting an element corresponding to the counter plus 1, so that when the element is deleted, the counter is lost. However, it is not so easy to ensure that elements are safely removed. First we must make sure that the deleted elements are indeed inside the Bron filter. This is not guaranteed by this filter alone. In addition, the counter wrapping can also cause problems.

Counting Bloom Filter

   

Insertion:increment counter
Deletion:decrement counter
Overflow:keep bit 1 Forever

To avoid a count overflow, the count must have a sufficient number of digits.
We first calculate the probability that the first counter is increased by J, where N is the number of set elements, K is the number of hash functions, and M is the number of counter (corresponding to the size of the original bit array):

   

In the expression at the right end of the equation, the first part represents the selection of J times from the NK sub-hash, the middle part indicates that the J-Hash is selected for the I-counter, and the latter part indicates that the other nk–j hashes do not have the I counter selected. Therefore, the probability that the value of the first counter is greater than J can be limited to:
  
   

add<t> (T item  {for  (int i = 0 ; i < K; i++) Array[hi (item )]++;} contains  <T> (T item ) {for  (int i = 0 ; i < K; i++) if  (!array[hi (item )])  return  false ;  return  true ;} Remove<t> (T item ) {for  (int i = 0 ; I < K; i++) Array[hi (item )]--; }  

Application Examples:
1) HTTP cache server, web crawler, etc.
The main task is to determine whether a URL is in the existing set of URLs (you can think of the magnitude of the data here billion).
For an HTTP cache server, when a PC in the local Area network initiates an HTTP request, the cache server checks to see if the URL already exists in the cache, and if so, there is no need to pull the data to the original server (for simplicity, we assume that the data has not changed), This saves traffic and speeds up access to improve the user experience.
For web crawlers, to determine whether the currently processing Web page has been processed, it also requires that the current URL exists in the list of URLs already processed.

2) Junk Mail Filter
Assuming that the mail server filters spam messages through the sender's mail domain or IP address, it is necessary to determine whether the current mail domain or IP address is blacklisted. You can also use the Bloom filter algorithm if the mail server has a very large number of communications messages (and you can think of data levels of billions).

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Bloom Filter (Bron filter)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.