Large-scale data processing tool bloomfilter

Source: Internet
Author: User
Tags bitset

Bloom filter is a rapid search algorithm proposed by bloom in 1970 that maps multiple hash functions. It is usually used in scenarios where you need to quickly determine whether an element belongs to a set, but not necessarily 100% correctly.

I. Instance

To illustrate the significance of the existence of the bloom filter, let's take an example:

Suppose you want to write a web crawler ). Due to the complexity of links between networks, spider crawling between networks is likely to form a "ring ". To avoid forming a "ring ",

You need to know the URLs that the spider has accessed. How do I know if a spider has accessed a URL? Think about the following solutions:

1. Save the accessed URL to the database.

2. Use hashset to save the accessed URL. You can check whether a URL has been accessed at the price close to O (1.

3. the URL is retained to the hashset or database after MD5 or SHA-1 hash.

4. Bit-map method. Create a bitset to map each URL to a bit through a hash function.

Method 1 ~ 3. All accessed URLs are completely saved. Method 4 only marks a ing bit of the URL.

 

The above methods can solve the problem perfectly when the data volume is small, but the problem arises when the data volume becomes very large.

Disadvantage of Method 1: the efficiency of relational database query becomes very low when the data volume becomes very large.

Disadvantage of Method 2: too much memory consumption. Even if there are only 0.1 billion URLs, each URL is 50 characters long and requires 5 GB of memory.

Method 3: The Digest length of the string after MD5 processing is only bits, and the Digest length after SHA-1 processing is only bits, saving the memory.

Method 4 consumes a relatively small amount of memory, but the disadvantage is that the probability of a single hash function conflict is too high. Remember how to solve hash table conflicts in Data Structure

? To reduce the probability of a conflict to 1%, set the length of the bitset to 100 times the number of URLs.

In essence, the above algorithm ignores an important implicit condition: the error with a low probability is not necessarily 100% accurate! That is to say, a small number of URLs are not actually accessible to web crawlers.

Q, but the mistake of judging them as the cost of access is very small-it's a big deal to catch a few webpages.

 

2. Bloom Filter Algorithm 


Here, we will introduce bloom filter, the main character of this article. In fact, the idea of Method 4 above is very close to bloom filter. The fatal disadvantage of Method 4 is conflict.

High Rate. To reduce the concept of conflict, Bloom filter uses multiple hash functions instead of one.

The bloom filter algorithm is as follows:

Create an M-bit bitset, Initialize all bits to 0, and then select k different hash functions. Hash function I records the hash result of string STR as H (I, STR ),

And the range of H (I, STR) is M-1.

 

(1) adding strings

 

The following is the process of processing each string. The first step is to "record" the string STR to the bitset process:

For STR strings, calculate H (1, STR), H (2, STR )...... H (K, STR ). Then, set the bitset's H (1, STR), H (2, STR )...... H (K, STR) is set to 1.

 

Figure 1. Adding a string to the bloom Filter

Is it easy? In this way, the string STR is mapped to K binary digits in the bitset.

 

(2) check whether a string exists

 

The following is the process of checking whether the STR string has been recorded by bitset:

For STR strings, calculate H (1, STR), H (2, STR )...... H (K, STR ). Then check the bitset's H (1, STR), H (2, STR )...... Whether the H (K, STR) bit is 1. If any of them is not 1, it can be determined that STR has never been recorded. If all bits are 1, "think" string STR exists.

 

If the bit of a string is not all 1, it is certain that the string has not been recorded by the bloom filter. (This is obvious, because the string has been recorded, and all its corresponding binary bits must be set to 1)

However, if the bits corresponding to a string are all 1, it cannot be 100% sure that the string has been recorded by the bloom filter. (Because it is possible that all the bits of this string are matched by other strings) This case of dividing the string incorrectly is called false positive.

 

(3) Deleting strings

When a string is added, it cannot be deleted because deletion affects other strings. Counting bloomfilter (CBMs) can be used to delete strings. This is a variant of the basic bloom filter. each bit of the basic bloom filter is changed to a counter by CBMs, in this way, you can delete strings.

 

The bit-map function of the bloom filter documentary hash function is different in that the bloom filter uses k hash functions, and each string corresponds to k bits. This reduces the probability of conflict.

 

3. Bloom filter parameter selection 

 

(1) Selection of Hash Functions

The selection of hash functions has a great impact on performance. A good hash function must map strings to each bit with an approximate probability. Selecting k different hash functions is troublesome. A simple method is to select a hash function and then input k different parameters.

(2) Bit Array Size Selection

The relationship between the number of hash functions K, the size of the Bit Array m, and the number of strings N can be found in reference 1. This document proves that for a given m, n, the probability of an error when k = ln (2) * M/N is the smallest.

At the same time, this document also gives the specific K, M, N error probability. For example, according to reference 1, if the number of hash functions K is 10, and the size of the Bit Array m is set to 20 times the number of strings N, the probability of false positive occurrence is 0.0000889, this probability can basically meet the needs of web crawlers.

 

Iv. Bloom filter implementation code

Below is a simple Java implementation code for bloom filter:

Import Java. util. bitset; public class bloomfilter {/* bitset initially allocates 2 ^ 24 bits */Private Static final int default_size = 1 <25;/* seeds of different hash functions, generally, the prime number */Private Static final int [] Seeds = new int [] {5, 7, 11, 13, 31, 37, 61} is required }; private bitset bits = new bitset (default_size);/* hash function object */private simplehash [] func = new simplehash [seeds. length]; Public bloomfilter () {for (INT I = 0; I <seeds. length; I ++) {func [I] = new simplehash (default_size, seeds [I]) ;}// mark the string to bits public void add (string value) {for (simplehash F: func) {bits. set (F. hash (value), true) ;}// judge whether the string has been marked by BITs public Boolean contains (string value) {If (value = NULL) {return false ;} boolean ret = true; For (simplehash F: func) {ret = RET & bits. get (F. hash (value);} return ret;}/* hash function class */public static class simplehash {private int cap; private int seed; Public simplehash (INT cap, int seed) {This. CAP = CAP; this. seed = seed;} // hash function, using simple weighting and hash public int Hash (string value) {int result = 0; int Len = value. length (); For (INT I = 0; I <Len; I ++) {result = seed * result + value. charat (I) ;}return (Cap-1) & Result ;}}}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.