"Turn" bloomfilter--large-scale data processing tool

Last Update:2015-08-11 Source: Internet

Author: User

Tags bitset

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original link http://www.cnblogs.com/heaad/archive/2011/01/02/1924195.html

Bloom filter is a fast lookup algorithm for multi-hash function mapping proposed by Bloom in 1970. It is often applied in some cases where it is necessary to quickly determine whether an element belongs to a collection, but is not strictly 100% correct.

a . Example

To illustrate the importance of the existence of the bloom filter, give an example:

Suppose you want to write a spider (web crawler). Because of the intricate links between networks, spiders crawling between networks are likely to form "rings". To avoid a "ring", you need to know that the spider has visited those URLs. To a URL, how do you know if a spider has visited it? If you think about it, there are several options:

1. Save the visited URL to the database.

2. Save the URL you visited with HashSet. Just close to the price of O (1) to find out if a URL has been accessed.

3. The URL is saved to the HashSet or database after a one-way hash such as MD5 or SHA-1.

4. Bit-map method. Create a bitset that maps each URL to a single hash function.

Method is the full save of the visited URL, method 4 only marks a map bit of the URL.

The above method solves the problem perfectly in the case of small amount of data, but the problem comes when the amount of data becomes very large.

Disadvantage of Method 1: The data volume becomes very large and the efficiency of relational database queries becomes very low. And every URL to start a database query is not too much fuss?

Disadvantage of Method 2: Memory consumption is too much. As the number of URLs increases, more and more memory is consumed. Even if there are only 100 million URLs, each URL is only 50 characters, which requires 5GB of memory.

Method 3: Because the string is MD5 processed, the information digest length is only 160Bit after 128bit,sha-1 processing, so Method 3 saves several times more memory than Method 2.

Method 4 consumes less memory, but the disadvantage is that the probability of a single hash function conflict is too high. Remember the data structure class to learn the hash table conflicts of various solutions? To reduce the probability of a conflict occurring to 1%, set the length of the bitset to 100 times times the number of URLs.

Essentially, the above algorithm ignores an important implied condition: Allow small probabilities of error, not necessarily 100% accurate! In other words, few URLs actually do not have network spider access, and they are wrongly sentenced to the cost of access is very small-a big deal less to grab a few pages.

two . algorithm of Bloom Filter

Nonsense here, the following introduction of this chapter of the protagonist--bloom Filter. In fact, the idea of method 4 above is already very close to bloom filter. The fatal disadvantage of method four is the high probability of conflict, in order to reduce the concept of conflict, Bloom filter uses multiple hash functions instead of one.

The Bloom filter algorithm is as follows:

Create a M-bit bitset, first initialize all bits to 0, and then choose K different hash functions. The I-hash function evaluates the result of the string str hash as H (I,STR), and the range of H (I,STR) is 0 to m-1.

(1) Adding a string procedure

The following is the process for each string processing, first of all, the process of "logging" the string str to Bitset:

For string str, calculate H (1,str), H (2,STR), H (K,STR), respectively. Then the bitset h (1,str), H (2,str) ... h (k,str) bit is set to 1.

Figure 1. Bloom Filter Join String procedure

Very simple, huh? This maps the string str to the K-bits in the bitset.

(2) procedure to check if a string exists

The following is a procedure for checking whether a string str has been Bitset logged:

For string str, calculate H (1,str), H (2,STR), H (K,STR), respectively. Then check bitset h (1,str), H (2,str) ... h (k,str) bit is 1, if any one of them is not 1 then you can determine STR must not be recorded. If all bits are 1, the "think" string str exists.

If a string corresponds to a bit that is not all 1, it is certain that the string must not have been recorded by the Bloom filter. (This is obvious, because the string is recorded, its corresponding bits must be set to 1)

But if a string corresponds to a bit that is all 1, it is actually not 100% sure that the string was recorded by the Bloom filter. (because it is possible that all the bits of the string are exactly the same as those of other strings), this is called false positive, which divides the string incorrectly.

(3) Delete string procedure

Strings are added and cannot be deleted because the deletion affects other strings. Really need to delete the string can use counting Bloomfilter (CBF), which is a variant of the basic Bloom filter, CBF will basic bloom filter each bit to a counter, so that the function of removing strings can be implemented.

The Bloom filter differs from the Tanhashi function Bit-map in that: The Bloom filter uses a K hash function, each string corresponding to the K bit. Thus reducing the probability of conflict.

three . Bloom Filter parameter selection

(1) Hash Function Selection

The effect of the hash function selection on performance should be large, and a good hash function can approximate equal probabilities to map strings to individual bits. Choosing k different hash functions is troublesome, a simple method is to select a hash function and then feed K different parameters.

(2) Bit Array Size selection

The relationship between the number of hash functions K, the bit array size m, and the amount of strings added can be referenced in reference 1. This document proves that the probability of an error when k = ln (2) * m/n is minimal for a given m and N.

At the same time, the paper also gives the error probability of specific k,m,n. For example: According to reference 1, the number of hash function K takes 10, the bit array size M is set to 20 times times the number of strings N, false positive the probability of occurrence is 0.0000889, this probability can basically meet the needs of the web crawler.

Four . Bloom Filter Implementation code

Here is a simple Java implementation code for Bloom filter:

ImportJava.util.BitSet;

PublicClassBloomfilter
{
/*Bitset Initial allocation of 2^24 bit*/
PrivateStaticFinalIntDefault_size=1<<25;
/*The seed of different hash function, generally should take prime number*/
PrivateStaticFinalInt[] Seeds=NewInt[] {5,7,11,13,31,37,61};
PrivateBitSet bits=NewBitSet (default_size);
/*Hash Function Object*/
PrivateSimplehash[] Func=NewSimplehash[seeds.length];

PublicBloomfilter ()
{
For(IntI=0; I<Seeds.length; I++)
{
Func[i]=NewSimplehash (Default_size, seeds[i]);
}
}

//To mark a string in bits
PublicvoidAdd (String value)
{
For(Simplehash F:func)
{
Bits.set (F.hash (value),True);
}
}

//Determines whether a string has been flagged by bits
PublicBooleanContains (String value)
{
If(Value==Null)
{
ReturnFalse;
}
BooleanRet=True;
For(Simplehash F:func)
{
Ret=Ret&&Bits.get (F.hash (value));
}
ReturnRet
}

/*hash function class*/
PublicStaticClassSimplehash
{
PrivateIntCap
PrivateIntSeed

PublicSimplehash (IntCapIntSeed
{
This. cap=Cap
This. Seed=Seed
}

//hash function, using simple weighted and hash
PublicIntHash (String value)
{
IntResult=0;
IntLen=Value.length ();
For (int I =0; I < Len; I++)
{
result = seed * result + Value.charat (i);
}
return- 1& result;
}
}
}
/span>

Reference documents:

[1] Pei Cao. Bloom filters-the Math.

Http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html

[2] Wikipedia. Bloom filter.

Http://en.wikipedia.org/wiki/Bloom_filter

"Turn" bloomfilter--large-scale data processing tool

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More