Bloom filter details of massive data processing

Last Update:2016-04-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It is possible that a miscarriage of judgment will not false negative first, what is Bloom FilterBloom Filter is a random data structure with high spatial efficiency, which is the principle that when an element is added to a set, the element is mapped to a K-point in a bit array by a K-hash function , set them to 1. When retrieving, we just have to look at whether these points are all 1 (about) know if there are any of them in the collection: If these points have any one 0, then the retrieved element must not be there, and if it is 1, then the retrieved element is likely to be in. This is the basic idea of the Bron filter. but this efficiency of bloom filter is a cost: when judging whether an element belongs to a set, it is possible to mistakenly think of elements that are not part of this set as belonging to this set ( false positive). Therefore, Bloom filter is not suitable for those "0 error" applications. In applications where low error rates can be tolerated, the Bloom filter provides significant savings in storage space with minimal errors. Some people may want to know its Chinese name, but there is a translation called bron filter . Should not be translated, whether the translation is appropriate, by the gentlemen product. In the following, if there are many formulas inadvertently understood, it is not a hindrance, only a little bit of understanding. 1.1. Collection representations and element queries

Let's take a look at how Bloom filter uses bit arrays to represent collections. In the initial state, the Bloom filter is an array of bits with M bits , each of which is set to 0.

To express s={x1, X2,..., xn} A collection of n elements, Bloom filter uses K- independent hash functions (hash function), which map each element in the collection to the scope of {1,..., m}, respectively. For any one element x, the location of the I-hash function mapping Hi (x) is set to 1(1≤i≤k). Note that if a position is set to 1 multiple times, only the first time will work, and the next few times will have no effect. In, k=3, and there are two hash functions selected in the same position (from the fifth digit to the left, i.e. the second "1").

In determining whether Y belongs to this set, we apply the K-hash function to Y , and if all hi (y) positions are 1(1≤i≤k), then we think y is the element in the collection, otherwise we think y is not an element in the collection. Y1 is not an element in the collection (because the Y1 has a point that points to the "0" bit). Y2 either belongs to this set or is just a false positive.

1.2. Error rate estimation

As we mentioned earlier, Bloom filter has a certain error rate (false positive rates) when judging whether an element belongs to the set it represents, and below we estimate the size of the error rate. Before estimating to simplify the model, we assume that kn<m and the individual hash functions are completely random. When all elements of the collection s={x1, X2,..., xn} are mapped to a bit array of M bits by a k hash function , the probability of one or 0 of the bit array is:

where 1/m represents the probability that any hash function selects this bit (provided that the hash function is completely random), (1-1/m) means that the hash does not have the probability of selecting this bit at a time. To fully map s into the array in place, the KN hash is required. One or 0 means that the KN hash is not selected, so this probability is the (1-1/m) kn. p = e-kn/m is to simplify the operation, here is used to calculate e when the approximate:

So that the ρ is the ratio of 0 in the bit array, then the mathematical expectation of ρ is e (ρ) = p '. In the case where ρ is known, the error rate required (false positive rates) is:

(1-ρ) is a scale of 1 in a bit array, and (1-ρ) k means that the K-times hash is just 1 of the range, which is false positive rate. The second approximation of the above is already mentioned in the previous step, and the first approximation is now. P ' is just the mathematical expectation of ρ, in practice the value of ρ may deviate from its mathematical expectations. M. Mitzenmacher has proved [2] that the proportion of 0 in a bit array is very centrally distributed around its mathematical expectations. Thus, the approximation of the first step is established. Put p and P ' into the above, respectively, to:

Using P and F is usually more convenient for analysis than P ' and f '.

Bloom filter details of massive data processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Bloom filter details of massive data processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Bloom filter details of massive data processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support