Prum Filter (Bloom filter) __hbase

Source: Internet
Author: User
Tags bit set bitset

What is a prum filter.

Wikipedia gives the following explanations:

prum filter (English: Bloom filter) was presented by Prum in 1970. It is actually a very long binary vector and a series of random mapping functions. The Prum filter can be used to retrieve whether an element is in a collection. Its advantage is that space efficiency and query time are far more than the general algorithm, the disadvantage is that there is a certain rate of error recognition and removal difficulties.

If you have 100 million URLs that are not duplicated, how do you decide if a Web address is included in these 100 million addresses? May be the first time to think of using a hash table. However, if you use a hash table, store 100 million URLs, assuming that 100 million URLs are compressed into 8-byte short URL, because the hash table will inevitably collide, if the control of the hash table filling factor in 0.5, that the minimum required memory size: 2*8*10^8/(1024^3) = 1.5G. If you need to store 1 billion, 10 billion URLs. Some people will say that the bitmap can be used to map each URL through a hash function to a certain one, such as 100 million URLs, only need 100 million bits can be saved, and only need 1*10^8/(1024^2*8) =11.9m. However, because the probability of a hash function collision is too high, if you want to reduce the conflict probability to 1%, at least set the bitmap length to 100 times times the number of URLs, so the amount of memory used is nearly 1 g. Obviously, in this case the hash table or bitmap processing is no longer appropriate. This time, we need to use the Prum filter.

Prum Filter Principle

The Prum filter requires a bit array, which is similar to a bitmap. You also need a K mapping function, which is similar to a hash table.

1) Adding elements

First, all the bits of the length m single-digit array are initialized to 0. The following figure:


Each one of them is a bits.

For the set S={S1,S2 with n elements,..., sn}, each element in the set S is mapped to K value {B1,B2,.., BK} by K mapping function {f1,f2,..., FK}, and then the corresponding b1,b2 in the bit array is,..., BK bit set to 1:


In this way, an element is mapped to K-bits.

2 Check if the element exists

If you want to find whether an element is in the set S. We can get K value {B1,B2,.., BK} by mapping function {F1,f2,..., FK}, and then determine whether the corresponding B1,B2,..., BK bit in the bit array is 1, if all is 1, the element is in set S, otherwise, it is not in set S.

But there is no misjudgment of the situation. That is, the bit of the corresponding bit array is 1, but the element is not in the collection. The answer is that there may be a miscalculation, but the probability is small, usually below one out of 10,000. Further discussion is given below.

calculation of false probability of Prum filter

Suppose that the hash function in the filter satisfies the assumption that each element is equal to the probability of any one of the hashes to M, regardless of which bit the other elements are hashed. Then the probability that a particular bit is not set to 1 when an element is inserted is: 1-1/m.

Then, none of the K-hash functions is set to 1 in the probability of: (1-1/m) ^k.

If n elements are inserted, there is no probability that the bit is set to 1: (1-1/m) ^kn.

So, the n element is inserted, and the probability of the bit being set to 1 is: 1-1/m ^kn.

In the query phase, if the corresponding K-bit of a waiting query element is all set to 1, it is determined to be in the collection. Therefore, the probability of the element being misjudged is


According to the Knowledge in higher mathematics:


It can be seen that when m increases or n decreases, the number of bits in the bit array or the number of elements in the collection decreases. So, the value of K is how many, can make the misjudgment rate is lowest.

After calculation, when the K=LN2XM/2, namely k=0.7*m/n, the misjudgment rate is the lowest. At this point, the misjudgment rate is approximately equal to 0.6185^ (m/n).

We can calculate the following at this time, for N=1 billion, if take m=10 billion, then k=7, the misjudgment rate is 0.0082, down to 1% below, but only need 119M memory can be stored.

The Prum filter misjudgment rate table is as follows:


A simple Teflon filter code is as follows:

Package com.algorithm;

Import Java.util.BitSet;
    public class Simplebloomfilter {private Bitset bits;
    private static final int[] seeds = new int[] {5, 7, 11, 13, 31, 37, 61};

    Private simplehash[] hashfunctions = new Simplehash[seeds.length];
    Public Simplebloomfilter () {This (2 << 24);
        public simplebloomfilter (int size) {bits = new bitset (size);
        for (int i = 0; i < hashfunctions.length i++) {hashfunctions[i] = new Simplehash (size,seeds[i)); } public void Add (String value) {for (Simplehash hashfunction:hashfunctions) {Bits.set
        (Hashfunction.gethashcode (value), true);
        } public Boolean contains (String value) {if (null = = value) {return false;
        Boolean result = true; for (Simplehash hashfunction:hashfunctions) {result = result && Bits.get (Hashfunction.gethashcode V
        Alue));

}        return result;
        public static void Main (string[] args) {Simplebloomfilter bloomfilter = new Simplebloomfilter ();
        String value = "iAm333";
        System.out.println (Bloomfilter.contains (value));
        Bloomfilter.add (value);
    System.out.println (Bloomfilter.contains (value)); }
}

Where the code that calculates the hash value:

Package com.algorithm;

public class Simplehash {
    private int size;
    private int seed;

    Simplehash (int size, int seed) {
        this.size = size;
        This.seed = seed;

    public int GetHashCode (String value) {
        int = 0;
        int length = Value.length ();
        for (int i = 0; i < length; i++) {Result
            = seed * result + value.charat (i);
        }
        Return (size-1) & result;
    }
Reprint Please indicate the source:http://blog.csdn.net/iAm333

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.