Big Data processing algorithm two: Bloom filter algorithm

Last Update:2015-04-29 Source: Internet

Author: User

Tags bitset

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Baidu interview question: Given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL?

Bloom filter is a fast lookup algorithm for multi-hash function mapping proposed by Bloom in 1970. It is often applied in some cases where it is necessary to quickly determine whether an element belongs to a collection, but is not strictly 100% correct.

I. examples

To illustrate the importance of the existence of the bloom filter, give an example:

(instance one), suppose you want to write a spider (web crawler). Because of the intricate links between networks, spiders crawling between networks are likely to form "rings". To avoid a "ring", you need to know that the spider has visited those URLs. To a URL, how do you know if a spider has visited it? Think about it a little bit,

(example II) given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL?

There are several options:

1. Save the visited URL to the database.

2. Save the URL you visited with HashSet. Just close to the price of O (1) to find out if a URL has been accessed.

3. The URL is saved to the HashSet or database after a one-way hash such as MD5 or SHA-1.

4. Bit-map method. Create a bitset that maps each URL to a single hash function.

Method is the full save of the visited URL, method 4 only marks a map bit of the URL.

The above method solves the problem perfectly in the case of small amount of data, but the problem comes when the amount of data becomes very large.

Disadvantage of Method 1: The data volume becomes very large and the efficiency of relational database queries becomes very low. And every URL to start a database query is not too much fuss?

Disadvantage of Method 2: Memory consumption is too much. As the number of URLs increases, more and more memory is consumed. Even if there are only 100 million URLs, each URL is only 50 characters, which requires 5GB of memory.

Method 3: Because the string is MD5 processed, the information digest length is only 160Bit after 128bit,sha-1 processing, so Method 3 saves several times more memory than Method 2.

Method 4 consumes less memory, but the disadvantage is that the probability of a single hash function conflict is too high. Remember the data structure class to learn the hash table conflicts of various solutions? To reduce the probability of a conflict occurring to 1%, set the length of the bitset to 100 times times the number of URLs.

Essentially, the above algorithm ignores an important implied condition: Allow small probabilities of error, not necessarily 100% accurate! In other words, few URLs actually do not have network spider access, and they are wrongly sentenced to the cost of access is very small-a big deal less to grab a few pages.

For example, there is a set of characters arr: "haha", "hehe" .....

String: "Haha"

Hashing algorithm 1 after processing:8

Hashing algorithm 2 after processing:1

Hashing algorithm 1 after processing:3

after inserting BitArray

To process the string again: "Hehe"

Hashing algorithm 1 after processing:2

Hashing algorithm 2 after processing:1

Hashing algorithm 1 after processing:9

If you continue to insert the BitArray after continuing with the string, continue inserting it in this way

Determine if these strings contain the "Hee"

Hashing algorithm 1 after processing:0

Hashing algorithm 2 after processing:1

hashing algorithm 1 after processing: 7

As long as the judgment subscript are 1 Span style= "Font-family:times New Roman; Background-color:inherit ">0 heel position 7 1

So "hehe" is not included in arr , otherwise if all 1 contains

The Java code is implemented as follows

Import Java.util.arraylist;import java.util.bitset;import java.util.list;/** * bloomfilter algorithm * * @author JYC506 * */PUBL    IC class Bloomfilter {/* hash function */private list<ihashfunction> hashfuctionlist;    /* Construction Method */public Bloomfilter () {this.hashfuctionlist = new arraylist<ihashfunction> ();}    /* Add the hash function class */public void Addhashfunction (Ihashfunction hashfunction) {this.hashFuctionList.add (hashfunction);}    /* Delete the hash function */public void removehashfunction (Ihashfunction hashfunction) {this.hashFuctionList.remove (hashfunction);} /* Determine if the */public Boolean contain (BitSet BitSet, String str) {for (ihashfunction hash:hashfuctionlist) {int hashcode = h) is included Ash.tohashcode (str); if (hashcode<0) {Hashcode=-hashcode;} if (Bitset.get (hashcode) = = False) {return false;}}    return true;} /* Added to Bitset*/public void Tobitset (BitSet BitSet, String str) {for (ihashfunction hash:hashfuctionlist) {int hashcode = h Ash.tohashcode (str); if (hashcode<0) {Hashcode=-hashcode;}    Bitset.set (Hashcode, True);}} PubLic static void Main (string[] args) {bloomfilter bloomfilter=new bloomfilter ();/* Add 3 hash functions */ Bloomfilter.addhashfunction (New Javahash ()); Bloomfilter.addhashfunction (new Rshash ()); Bloomfilter.addhashfunction (New Sdbmhash ());/* 24 */bitset bitset=new BitSet (1<<25) with a length of 2;/* Judge Test1 very test2 duplicate string */string[] test1=new string[]{"haha", "I", "Everyone", "tease", "rich sex", "Xiaomi", "Iphone", "HelloWorld"};for ( String str1:test1) {bloomfilter.tobitset (BitSet, str1);} String[] test2=new string[]{"haha", "my", "Everyone", "tease", "rich humanity", "millet", "iphone6s", "HelloWorld"};for (String str2:test2) {if ( Bloomfilter.contain (BitSet, str2)) {System.out.println ("'" +str2+ "' is Duplicate");}} System.out.println (New Rshash (). Tohashcode ("haha")); System.out.println (New Rshash (). Tohashcode ("AA"));}} /* Hash Function interface */interface ihashfunction {int tohashcode (String str);} Class Javahash implements ihashfunction {@Overridepublic int tohashcode (String str) {return str.hashcode ();}} Class Rshash implements ihashfunction {@Overridepublic int tohashcode (String str) {int b = 378551;int A = 63689;int hash = 0;for (int i = 0; i < str.length (); i++) {hash = hash * A + str.charat (i); a = a * b;} return hash;}} Class Sdbmhash implements ihashfunction {@Overridepublic int tohashcode (String str) {int hash = 0;for (int i = 0; I < s Tr.length (); i++) hash = Str.charat (i) + (hash << 6) + (hash << +)-Hash;return hash;}}

Big Data processing algorithm two: Bloom filter algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More