Bloom filter details of massive data processing

Source: Internet
Author: User

It is possible that a miscarriage of judgment will not false negative  first, what is Bloom FilterBloom Filter is a random data structure with high spatial efficiency, which is the principle that when an element is added to a set, the element is mapped to a K-point in a bit array by a K-hash function , set them to 1. When retrieving, we just have to look at whether these points are all 1 (about) know if there are any of them in the collection: If these points have any one 0, then the retrieved element must not be there, and if it is 1, then the retrieved element is likely to be in. This is the basic idea of the Bron filter.  but this efficiency of bloom filter is a cost: when judging whether an element belongs to a set, it is possible to mistakenly think of elements that are not part of this set as belonging to this set ( false positive). Therefore, Bloom filter is not suitable for those "0 error" applications. In applications where low error rates can be tolerated, the Bloom filter provides significant savings in storage space with minimal errors.  Some people may want to know its Chinese name, but there is a translation called bron filter . Should not be translated, whether the translation is appropriate, by the gentlemen product. In the following, if there are many formulas inadvertently understood, it is not a hindrance, only a little bit of understanding. 1.1. Collection representations and element queries

Let's take a look at how Bloom filter uses bit arrays to represent collections. In the initial state, the Bloom filter is an array of bits with M bits , each of which is set to 0.

To express s={x1, X2,..., xn} A collection of n elements, Bloom filter uses K- independent hash functions (hash function), which map each element in the collection to the scope of {1,..., m}, respectively. For any one element x, the location of the I-hash function mapping Hi (x) is set to 1(1≤i≤k). Note that if a position is set to 1 multiple times, only the first time will work, and the next few times will have no effect. In, k=3, and there are two hash functions selected in the same position (from the fifth digit to the left, i.e. the second "1").

In determining whether Y belongs to this set, we apply the K-hash function to Y , and if all hi (y) positions are 1(1≤i≤k), then we think y is the element in the collection, otherwise we think y is not an element in the collection. Y1 is not an element in the collection (because the Y1 has a point that points to the "0" bit). Y2 either belongs to this set or is just a false positive.

1.2. Error rate estimation

As we mentioned earlier, Bloom filter has a certain error rate (false positive rates) when judging whether an element belongs to the set it represents, and below we estimate the size of the error rate. Before estimating to simplify the model, we assume that kn<m and the individual hash functions are completely random. When all elements of the collection s={x1, X2,..., xn} are mapped to a bit array of M bits by a k hash function , the probability of one or 0 of the bit array is:


where 1/m represents the probability that any hash function selects this bit (provided that the hash function is completely random), (1-1/m) means that the hash does not have the probability of selecting this bit at a time. To fully map s into the array in place, the KN hash is required. One or 0 means that the KN hash is not selected, so this probability is the (1-1/m) kn. p = e-kn/m is to simplify the operation, here is used to calculate e when the approximate:

So that the ρ is the ratio of 0 in the bit array, then the mathematical expectation of ρ is e (ρ) = p '. In the case where ρ is known, the error rate required (false positive rates) is:

(1-ρ) is a scale of 1 in a bit array, and (1-ρ) k means that the K-times hash is just 1 of the range, which is false positive rate. The second approximation of the above is already mentioned in the previous step, and the first approximation is now. P ' is just the mathematical expectation of ρ, in practice the value of ρ may deviate from its mathematical expectations. M. Mitzenmacher has proved [2] that the proportion of 0 in a bit array is very centrally distributed around its mathematical expectations. Thus, the approximation of the first step is established. Put p and P ' into the above, respectively, to:

Using P and F is usually more convenient for analysis than P ' and f '.

Bloom filter details of massive data processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.