"Turn" bloomfilter--large-scale data processing tool

Source: Internet
Author: User
Tags bitset

Original link http://www.cnblogs.com/heaad/archive/2011/01/02/1924195.html

Bloom filter is a fast lookup algorithm for multi-hash function mapping proposed by Bloom in 1970. It is often applied in some cases where it is necessary to quickly determine whether an element belongs to a collection, but is not strictly 100% correct.

a . Example  

To illustrate the importance of the existence of the bloom filter, give an example:

Suppose you want to write a spider (web crawler). Because of the intricate links between networks, spiders crawling between networks are likely to form "rings". To avoid a "ring", you need to know that the spider has visited those URLs. To a URL, how do you know if a spider has visited it? If you think about it, there are several options:

1. Save the visited URL to the database.

2. Save the URL you visited with HashSet. Just close to the price of O (1) to find out if a URL has been accessed.

3. The URL is saved to the HashSet or database after a one-way hash such as MD5 or SHA-1.

4. Bit-map method. Create a bitset that maps each URL to a single hash function.

Method is the full save of the visited URL, method 4 only marks a map bit of the URL.

The above method solves the problem perfectly in the case of small amount of data, but the problem comes when the amount of data becomes very large.

Disadvantage of Method 1: The data volume becomes very large and the efficiency of relational database queries becomes very low. And every URL to start a database query is not too much fuss?

Disadvantage of Method 2: Memory consumption is too much. As the number of URLs increases, more and more memory is consumed. Even if there are only 100 million URLs, each URL is only 50 characters, which requires 5GB of memory.

Method 3: Because the string is MD5 processed, the information digest length is only 160Bit after 128bit,sha-1 processing, so Method 3 saves several times more memory than Method 2.

Method 4 consumes less memory, but the disadvantage is that the probability of a single hash function conflict is too high. Remember the data structure class to learn the hash table conflicts of various solutions? To reduce the probability of a conflict occurring to 1%, set the length of the bitset to 100 times times the number of URLs.

Essentially, the above algorithm ignores an important implied condition: Allow small probabilities of error, not necessarily 100% accurate! In other words, few URLs actually do not have network spider access, and they are wrongly sentenced to the cost of access is very small-a big deal less to grab a few pages.

two . algorithm of Bloom Filter

Nonsense here, the following introduction of this chapter of the protagonist--bloom Filter. In fact, the idea of method 4 above is already very close to bloom filter. The fatal disadvantage of method four is the high probability of conflict, in order to reduce the concept of conflict, Bloom filter uses multiple hash functions instead of one.

The Bloom filter algorithm is as follows:

Create a M-bit bitset, first initialize all bits to 0, and then choose K different hash functions. The I-hash function evaluates the result of the string str hash as H (I,STR), and the range of H (I,STR) is 0 to m-1.

(1) Adding a string procedure

The following is the process for each string processing, first of all, the process of "logging" the string str to Bitset:

For string str, calculate H (1,str), H (2,STR), H (K,STR), respectively. Then the bitset h (1,str), H (2,str) ... h (k,str) bit is set to 1.

Figure 1. Bloom Filter Join String procedure

Very simple, huh? This maps the string str to the K-bits in the bitset.

(2) procedure to check if a string exists

The following is a procedure for checking whether a string str has been Bitset logged:

For string str, calculate H (1,str), H (2,STR), H (K,STR), respectively. Then check bitset h (1,str), H (2,str) ... h (k,str) bit is 1, if any one of them is not 1 then you can determine STR must not be recorded. If all bits are 1, the "think" string str exists.

If a string corresponds to a bit that is not all 1, it is certain that the string must not have been recorded by the Bloom filter. (This is obvious, because the string is recorded, its corresponding bits must be set to 1)

But if a string corresponds to a bit that is all 1, it is actually not 100% sure that the string was recorded by the Bloom filter. (because it is possible that all the bits of the string are exactly the same as those of other strings), this is called false positive, which divides the string incorrectly.

(3) Delete string procedure

Strings are added and cannot be deleted because the deletion affects other strings. Really need to delete the string can use counting Bloomfilter (CBF), which is a variant of the basic Bloom filter, CBF will basic bloom filter each bit to a counter, so that the function of removing strings can be implemented.

The Bloom filter differs from the Tanhashi function Bit-map in that: The Bloom filter uses a K hash function, each string corresponding to the K bit. Thus reducing the probability of conflict.

three . Bloom Filter parameter selection

(1) Hash Function Selection

The effect of the hash function selection on performance should be large, and a good hash function can approximate equal probabilities to map strings to individual bits. Choosing k different hash functions is troublesome, a simple method is to select a hash function and then feed K different parameters.

(2) Bit Array Size selection  

The relationship between the number of hash functions K, the bit array size m, and the amount of strings added can be referenced in reference 1. This document proves that the probability of an error when k = ln (2) * m/n is minimal for a given m and N.

At the same time, the paper also gives the error probability of specific k,m,n. For example: According to reference 1, the number of hash function K takes 10, the bit array size M is set to 20 times times the number of strings N, false positive the probability of occurrence is 0.0000889, this probability can basically meet the needs of the web crawler.

Four . Bloom Filter Implementation code

Here is a simple Java implementation code for Bloom filter:

ImportJava.util.BitSet;

PublicClassBloomfilter
{
/*Bitset Initial allocation of 2^24 bit*/
PrivateStaticFinalIntDefault_size=1<<25;
/*The seed of different hash function, generally should take prime number*/
PrivateStaticFinalInt[] Seeds=NewInt[] {5,7,11,13,31,37,61};
PrivateBitSet bits=NewBitSet (default_size);
/*Hash Function Object*/
PrivateSimplehash[] Func=NewSimplehash[seeds.length];

PublicBloomfilter ()
{
For(IntI=0; I<Seeds.length; I++)
{
Func[i]=NewSimplehash (Default_size, seeds[i]);
}
}

//To mark a string in bits
PublicvoidAdd (String value)
{
For(Simplehash F:func)
{
Bits.set (F.hash (value),True);
}
}

//Determines whether a string has been flagged by bits
PublicBooleanContains (String value)
{
If(Value==Null)
{
ReturnFalse;
}
BooleanRet=True;
For(Simplehash F:func)
{
Ret=Ret&&Bits.get (F.hash (value));
}
ReturnRet
}

/*hash function class*/
PublicStaticClassSimplehash
{
PrivateIntCap
PrivateIntSeed

PublicSimplehash (IntCapIntSeed
{
This. cap=Cap
This. Seed=Seed
}

//hash function, using simple weighted and hash
PublicIntHash (String value)
{
IntResult=0;
IntLen=Value.length ();
For (int I =0; I < Len; I++)
{
result = seed * result + Value.charat (i);
}
return- 1& result;
}
}
}
/span>

Reference documents:

[1] Pei Cao. Bloom filters-the Math.

Http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html

[2] Wikipedia. Bloom filter.

Http://en.wikipedia.org/wiki/Bloom_filter

"Turn" bloomfilter--large-scale data processing tool

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.