Bron Filters (Bloom filter)

Source: Internet
Author: User

In the real-time processing system of big data, the accumulative calculation (PV statistic) can be solved by accumulator, and the non-additive calculation (UV statistic) requires the loss of certain accuracy to ensure the efficiency of execution and the estimation of the final value. One of the methods of estimation is the Bron filter.

BF is a binary vector data structure with high space and time efficiency. The rationale is to store the collection information using a bit array m of length m, while mapping the dataset D into the array space using K-independent hash function K. With the k mapping, each element of D occupies K bits in M, and the corresponding position is 1. When the calculation is whether an element is in D, the mapping location is computed by K, and all k positions are 1 o'clock, indicating that the element already exists. Otherwise, if the element is not in data d, consider adding the new element to D and placing the corresponding position of M at 1, or you can do nothing and return only the filter results (depending on the business situation).

BF's advantage is that space efficiency and query efficiency are far more than the general algorithm, the disadvantage is that there is a certain rate of miscarriage (false positive example false positives) and remove the difficulty, but there is no false negative case (that is, false anti-negatives), so, UV using the BF algorithm is smaller than the real UV. Given the characteristics of BF, the use of this algorithm needs to allow the generation of false positives. 1) Why does BF have a case of miscalculation? When D is mapped to M through K, the K mapping values of an element may be completely overlapping with the existing element mappings (no need to completely overlap the mapping position of the individual elements), which is the element that will be misjudged as already present. 2) Why is BF not false negative? Because the BF does not have an operation that changes 1 to 0, the mapped position of the element being joined is not assigned a value of 0.

Speaking of which, we should talk about the error rate. The actual application often hope that the error rate control in a certain range, how should this be ensured? First of all, there are several factors that affect the error rate: Data set size D, the number of hash function K K, and the bit array size M. The smaller the D, the greater the M, the smaller the error rate, and the better understanding of the effects of these two factors. K on the impact of the error rate is more complex: on the one hand, the greater the k, there are more bits array element is set to 1, the new entry element is the probability of being misjudged; on the other hand, the multiple bits corresponding to one element are all 1 o'clock, so the greater the k, the smaller the error rate should be. After the mathematical analysis of the relationship is as follows: Pfp≈ (1-e-kd/m) k, if D, m known, so that Pfp the smallest k is K=M/D*LN2. The corresponding false rate can be calculated by the formula.

in Practical engineering applications, the error rate p FP is often delineated, thedata set size D is known, and the hardware resource (here is the memory resource) that needs to be calculated, that is, the size of the mapping Space m,m=-(D*LNP)/(LN2) 2.

Talking about here, the basic idea of BF is finished. But there are several questions that need to be extended: how should the hash function be designed? If it is only used for technology, can the BF algorithm be improved?

Cond....

Bron Filters (Bloom filter)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.