Big Data processing algorithm two: Bloom filter algorithm

Source: Internet
Author: User
Tags bitset

Baidu interview question: Given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL?

Bloom filter is a fast lookup algorithm for multi-hash function mapping proposed by Bloom in 1970. It is often applied in some cases where it is necessary to quickly determine whether an element belongs to a collection, but is not strictly 100% correct.

I. examples

To illustrate the importance of the existence of the bloom filter, give an example:

(instance one), suppose you want to write a spider (web crawler). Because of the intricate links between networks, spiders crawling between networks are likely to form "rings". To avoid a "ring", you need to know that the spider has visited those URLs. To a URL, how do you know if a spider has visited it? Think about it a little bit,

(example II) given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL?

There are several options:

1. Save the visited URL to the database.

2. Save the URL you visited with HashSet. Just close to the price of O (1) to find out if a URL has been accessed.

3. The URL is saved to the HashSet or database after a one-way hash such as MD5 or SHA-1.

4. Bit-map method. Create a bitset that maps each URL to a single hash function.

Method is the full save of the visited URL, method 4 only marks a map bit of the URL.


The above method solves the problem perfectly in the case of small amount of data, but the problem comes when the amount of data becomes very large.

Disadvantage of Method 1: The data volume becomes very large and the efficiency of relational database queries becomes very low. And every URL to start a database query is not too much fuss?

Disadvantage of Method 2: Memory consumption is too much. As the number of URLs increases, more and more memory is consumed. Even if there are only 100 million URLs, each URL is only 50 characters, which requires 5GB of memory.

Method 3: Because the string is MD5 processed, the information digest length is only 160Bit after 128bit,sha-1 processing, so Method 3 saves several times more memory than Method 2.

Method 4 consumes less memory, but the disadvantage is that the probability of a single hash function conflict is too high. Remember the data structure class to learn the hash table conflicts of various solutions? To reduce the probability of a conflict occurring to 1%, set the length of the bitset to 100 times times the number of URLs.

Essentially, the above algorithm ignores an important implied condition: Allow small probabilities of error, not necessarily 100% accurate! In other words, few URLs actually do not have network spider access, and they are wrongly sentenced to the cost of access is very small-a big deal less to grab a few pages.

For example, there is a set of characters arr: "haha", "hehe" .....

String: "Haha"

Hashing algorithm 1 after processing:8

Hashing algorithm 2 after processing:1

Hashing algorithm 1 after processing:3

after inserting BitArray


To process the string again: "Hehe"

Hashing algorithm 1 after processing:2

Hashing algorithm 2 after processing:1

Hashing algorithm 1 after processing:9

If you continue to insert the BitArray after continuing with the string, continue inserting it in this way

Determine if these strings contain the "Hee"

Hashing algorithm 1 after processing:0

Hashing algorithm 2 after processing:1

hashing algorithm 1 after processing: 7

As long as the judgment   subscript are   1 Span style= "Font-family:times New Roman; Background-color:inherit ">0 heel position 7 1

So "hehe" is not included in arr , otherwise if all 1 contains


The Java code is implemented as follows

Import Java.util.arraylist;import java.util.bitset;import java.util.list;/** * bloomfilter algorithm * * @author JYC506 * */PUBL    IC class Bloomfilter {/* hash function */private list<ihashfunction> hashfuctionlist;    /* Construction Method */public Bloomfilter () {this.hashfuctionlist = new arraylist<ihashfunction> ();}    /* Add the hash function class */public void Addhashfunction (Ihashfunction hashfunction) {this.hashFuctionList.add (hashfunction);}    /* Delete the hash function */public void removehashfunction (Ihashfunction hashfunction) {this.hashFuctionList.remove (hashfunction);} /* Determine if the */public Boolean contain (BitSet BitSet, String str) {for (ihashfunction hash:hashfuctionlist) {int hashcode = h) is included Ash.tohashcode (str); if (hashcode<0) {Hashcode=-hashcode;} if (Bitset.get (hashcode) = = False) {return false;}}    return true;} /* Added to Bitset*/public void Tobitset (BitSet BitSet, String str) {for (ihashfunction hash:hashfuctionlist) {int hashcode = h Ash.tohashcode (str); if (hashcode<0) {Hashcode=-hashcode;}    Bitset.set (Hashcode, True);}} PubLic static void Main (string[] args) {bloomfilter bloomfilter=new bloomfilter ();/* Add 3 hash functions */ Bloomfilter.addhashfunction (New Javahash ()); Bloomfilter.addhashfunction (new Rshash ()); Bloomfilter.addhashfunction (New Sdbmhash ());/* 24 */bitset bitset=new BitSet (1<<25) with a length of 2;/* Judge Test1 very test2 duplicate string */string[] test1=new string[]{"haha", "I", "Everyone", "tease", "rich sex", "Xiaomi", "Iphone", "HelloWorld"};for ( String str1:test1) {bloomfilter.tobitset (BitSet, str1);} String[] test2=new string[]{"haha", "my", "Everyone", "tease", "rich humanity", "millet", "iphone6s", "HelloWorld"};for (String str2:test2) {if ( Bloomfilter.contain (BitSet, str2)) {System.out.println ("'" +str2+ "' is Duplicate");}} System.out.println (New Rshash (). Tohashcode ("haha")); System.out.println (New Rshash (). Tohashcode ("AA"));}} /* Hash Function interface */interface ihashfunction {int tohashcode (String str);} Class Javahash implements ihashfunction {@Overridepublic int tohashcode (String str) {return str.hashcode ();}} Class Rshash implements ihashfunction {@Overridepublic int tohashcode (String str) {int b = 378551;int A = 63689;int hash = 0;for (int i = 0; i < str.length (); i++) {hash = hash * A + str.charat (i); a = a * b;} return hash;}} Class Sdbmhash implements ihashfunction {@Overridepublic int tohashcode (String str) {int hash = 0;for (int i = 0; I < s Tr.length (); i++) hash = Str.charat (i) + (hash << 6) + (hash << +)-Hash;return hash;}}



Big Data processing algorithm two: Bloom filter algorithm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.