Bloom Filter (BF) is a fast lookup algorithm for multi-hash function mappings proposed by Bloom in 1970 to **quickly** find whether an element belongs to a collection, but does not require a hundred percent accuracy rate. Bloom filter is often used for crawling URLs to determine if a URL has been crawled. Principle I quoted a person's article, speak more clearly, I will not repeat here, more information can refer to its paper. Read a few PHP implementation of the BF, all feel that readability is not very strong, this article mainly gives me a PHP implementation of bloom filter.

Principle:

< references from this article >

I. Examples

To illustrate the importance of the existence of the bloom filter, give an example:

Suppose you want to write a spider (web crawler). Because of the intricate links between networks, spiders crawling between networks are likely to form "rings". To avoid a "ring", you need to know that the spider has visited those URLs. To a URL, how do you know if a spider has visited it? If you think about it, there are several options:

1. Save the visited URL to the database.

2. Save the URL you visited with HashSet. Just close to the price of O (1) to find out if a URL has been accessed.

3. The URL is saved to the HashSet or database after a one-way hash such as MD5 or SHA-1.

4. Bit-map method. Create a bitset that maps each URL to a single hash function.

Method is the full save of the visited URL, method 4 only marks a map bit of the URL.

The above method solves the problem perfectly in the case of small amount of data, but the problem comes when the amount of data becomes very large.

Disadvantage of Method 1: The data volume becomes very large and the efficiency of relational database queries becomes very low. And every URL to start a database query is not too much fuss?

Disadvantage of Method 2: Memory consumption is too much. As the number of URLs increases, more and more memory is consumed. Even if there are only 100 million URLs, each URL is only 50 characters, which requires 5GB of memory.

Method 3: Because the string is MD5 processed, the information digest length is only 160Bit after 128bit,sha-1 processing, so Method 3 saves several times more memory than Method 2.

Method 4 consumes less memory, but the disadvantage is that the probability of a single hash function conflict is too high. Remember the data structure class to learn the hash table conflicts of various solutions? To reduce the probability of a conflict occurring to 1%, set the length of the bitset to 100 times times the number of URLs.

Essentially, the above algorithm ignores an important implied condition: Allow small probabilities of error, not necessarily 100% accurate! In other words, few URLs actually do not have network spider access, and they are wrongly sentenced to the cost of access is very small-a big deal less to grab a few pages.

Two. Bloom Filter algorithm

Nonsense here, the following introduction of this chapter of the protagonist--bloom Filter. In fact, the idea of method 4 above is already very close to bloom filter. The fatal disadvantage of method four is the high probability of conflict, in order to reduce the concept of conflict, Bloom filter uses multiple hash functions instead of one.

The Bloom filter algorithm is as follows:

`(1)初始化`

Create a M-bit bitset, first initialize all bits to 0, and then choose K different hash functions. The I-hash function evaluates the result of the string str hash as H (I,STR), and the range of H (I,STR) is 0 to m-1.

`(2) 检查字符串是否存在`

The following is a procedure for checking whether a string str has been Bitset logged:

For string str, calculate H (1,str), H (2,STR), H (K,STR), respectively. Then check bitset h (1,str), H (2,str) ... h (k,str) bit is 1, if any one of them is not 1 then you can determine STR must not be recorded. If all bits are 1, the "think" string str exists.

If a string corresponds to a bit that is not all 1, it is certain that the string must not have been recorded by the Bloom filter. (This is obvious, because the string is recorded, its corresponding bits must be set to 1)

But if a string corresponds to a bit that is all 1, it is actually not 100% sure that the string was recorded by the Bloom filter. (because it is possible that all the bits of the string are exactly the same as those of other strings), this is called false positive, which divides the string incorrectly.

`(3) 删除字符串 :`

Strings are added and cannot be deleted because the deletion affects other strings. Really need to delete the string can use counting Bloomfilter (CBF), which is a variant of the basic Bloom filter, CBF will basic bloom filter each bit to a counter, so that the function of removing strings can be implemented.

The Bloom filter differs from the Tanhashi function Bit-map in that: The Bloom filter uses a K hash function, each string corresponding to the K bit. Thus reducing the probability of conflict.

Three. Bloom Filter Parameter Selection

(1) Hash function selection

The effect of the hash function selection on performance should be large, and a good hash function can approximate equal probabilities to map strings to individual bits. Choosing k different hash functions is troublesome, a simple method is to select a hash function and then feed K different parameters.

(2) Bit array size selection

The relationship between the number of hash functions K, the bit array size m, and the amount of strings added can be referenced in reference 1. This document proves that the probability of an error when k = ln (2) * m/n is minimal for a given m and N.

At the same time, the paper also gives the error probability of specific k,m,n. For example: According to the reference, the number of hash function K takes 10, the bit array size M is set to 20 times times the number of strings N, false positive the probability of occurrence is 0.0000889, this probability basically can meet the needs of the web crawler.

Realize:

`<?php///***************************************************************************// * //* Copyright (c) Baidu.com, Inc. All Rights Reserved// * // **************************************************************************/// // // ///**//* @file bloomfilter.php//* @author Rachel Zhang ([email protected])//* @date 2015/07/24 18:48:57//* @version $Revision $//* @brief// * // **/ class bloomfilter{ var $m;# blocksize var $n;# number of strings to hashes var $k;# Number of hashing functions var $bitset;# hashing block with size M function bloomfilter($mInit,$nInit){ $this->m =$mInit;$this->n =$nInit;$this->k = Ceil (($this->m/$this->n) *log (2));Echo "Number of functions: $this->k\n";$this->bitset = Array_fill (0,$this->m,false); } function hashcode($str){ $res=Array();#put k hashing bit into $res $seed= CRC32 ($str); Mt_srand ($seed);//Set random seed, or mt_rand wouldn ' t provide same random arrays at different generation for($i=0;$i<$this->k;$i++){$res[] = Mt_rand (0,$this->m-1); }return $res; } function addkey($key){ foreach($this->hashcode ($key) as $codebit){$this->bitset[$codebit]=true; } } function existkey($key){ $code=$this->hashcode ($key);foreach($code as $codebit){if($this->bitset[$codebit]==false){return false; } }return true; }}$BF=NewBloomfilter (Ten,2);$str _add1="Test1";$str _add2="Test2";$str _notadd3="Test3";//var_dump ($BF->hashcode ($STR));$BF->addkey ($str _add1);$BF->addkey ($str _add2); Var_dump ($BF->existkey ($str _add1)); Var_dump ($BF->existkey ($str _add2)); Var_dump ($BF->existkey ($str _notadd3));?>`

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

PHP Implementation Bloom Filter