Counting bloom Filter
Jomeng January 30, 2007
According to the introduction of the bloom filter in the previous articles, the standard Bloom filter is a very simple data structure. It only supports insert and search operations. When the set to be expressed is a static set, the standard Bloom filter can work well. However, if the set to be expressed is changed frequently, the disadvantages of the standard Bloom filter are shown, because it does not support deletion.
The appearance of counting bloom filter solves this problem. It extends each bit of the standard Bloom filter Bit Array to a small counter (Counter ), when an element is inserted, add 1 to the values of the corresponding counter (k is the number of hash functions) and 1 to the values of the corresponding K counters When deleting the element. Counting bloom filter adds a delete operation to the bloom filter by consuming several times more storage space. The next question is, how many times will it take?
First, we calculate the probability that the I counter is increased by J times. N indicates the number of elements in the Set, K indicates the number of hash functions, and M indicates the number of counters (corresponding to the size of the original array):
In the expression on the right of the equation above, the first part indicates that J times are selected from the NK hash, and the middle part indicates that the I counter is selected for the J hash, the latter part indicates that the I-th counter is not selected for other NK-J hash operations. Therefore, the probability that the I counter value is greater than J can be limited:
In step 2 of the above formula, the following formula is used to estimate the factorial:
In the article on the concept and principle of bloom filter, we mentioned that the optimal value of K is (ln2) M/N. Now we limit k to ≤ (ln2) M/N, the following conclusions can be drawn:
If each counter is assigned four digits, the counter will overflow when the counter value reaches 16. This probability is:
This value is small enough, so four digits are sufficient for most applications.
The earliest paper about counting bloom filter: Summary cache: A Scalable wide-area Web Cache sharing protocol