What is Bloom Filter for massive data processing?

Source: Internet
Author: User

What is Bloom Filter]

Bloom Filter is a space-efficient random data structure. It uses a bit array to easily represent a set and determines whether an element belongs to the set. The efficiency of the Bloom Filter has a certain price: when determining whether an element belongs to a set, it is possible that the elements that do not belong to this set are mistakenly considered to belong to this set (false positive ). Therefore, the Bloom Filter is not suitable for applications with zero errors. In applications that can tolerate low error rates, the Bloom Filter exchanges a small number of errors for a huge savings in storage space. Here is a detailed introduction to Bloom Filter. You can see it if you are not familiar with it.

Scope of application]

It can be used to implement a data dictionary, to determine the duplication of data, or to obtain the intersection of data sets.

Basic principles and key points]

The principle is very simple, with a Bit Array + k independent hash functions. Set the bit array of the value corresponding to the hash function to 1. If you find that all the corresponding bits of the hash function are 1, obviously, this process does not guarantee that the search result is 100% correct. At the same time, a inserted keyword cannot be deleted, because the bit corresponding to this keyword affects other keywords. Therefore, a simple improvement is the counting Bloom filter, which can be deleted by replacing the bitwise array with a counter array.

Another important issue is how to determine the size of the Bit Array m and the number of hash functions based on the number of input elements n. When the number of hash functions is k = (ln2) * (m/n), the error rate is the minimum. If the error rate is not greater than E, m must at least be equal to n * lg (1/E) to represent a set of any n elements. But m should be larger, because at least half of the bit array should be 0, then m should be equal to> = nlg (1/E) * lge is probably nlg (1/E) 1.44 times (lg represents the base 2 logarithm ).

For example, if the error rate is 0.01, m is 13 times larger than n. In this case, k is about 8.

Note that the unit of m is different from that of n, m is bit, and n is the unit of the number of elements (accurately speaking, the number of different elements ). Generally, the length of a single element is many bits. Therefore, the use of bloom filter memory is usually saved.

Extension]

The Bloom filter maps the elements in the set to an array. If k (k is the number of Hash Functions) ing bits are all 1, it indicates that the element is not in this set. Counting bloom filter (CBMs) extends each bit in the bit array to a counter, which supports the deletion of elements. Spectral Bloom Filter (SBF) associates it with the number of occurrences of the Set element. SBF uses the minimum value in counter to represent the occurrence frequency of elements.

Problem instance]

Here are two files A and B, each containing 5 billion URLs. Each URL occupies 64 bytes and the memory limit is 4 GB. Let you find the common URLs of files A and B. What if there are three or even n files?

Based on this problem, we calculate the memory usage. 4G = 2 ^ 32 is about 4 billion * 8 is about 34 billion, n = 5 billion, if the error rate is 0.01, 65 billion bits are required. Currently, 34 billion is available, and there are not many differences. This may increase the error rate. In addition, if these URLs correspond one-to-one, you can convert them into ip addresses, which is much simpler.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.