<?php/*bloom filter algorithm to filter. This paper introduces the basic treatment of Bloom filter: The application of a batch of space to save 0 1 information, and then according to a number of hash function to determine the location of the element, if each hash function corresponding to the position of the value of all 1, indicating that this element exists. Conversely, if 0, the value of the corresponding position is set to 1.
Because different elements may have the same hash value, that is, the same location may have saved more than one element of information, resulting in a certain rate of miscalculation. If the application space is too small, with the increase in the number of elements, 1 will be more and more opportunities for various elements conflict, resulting in more and more misjudgment rate.
In addition, the selection and number of hash functions should be balanced, multiple hash functions can provide the accuracy of judgment, but will reduce the processing speed of the program, and the increase in the hash function requires more space to store location information.
The application of Bloom-filter. Bloom-filter is generally used to determine whether an element exists in a set of large data volumes. For example, a spam filter in a mail server. In the Search engine field, Bloom-filter is most commonly used for web spiders (Spider) URL filter, Web Spiders usually have a list of URLs, save will download and have downloaded the URL of the Web page, Web spider downloaded a Web page, from the page to the new URL, You need to determine if the URL already exists in the list.
At this point, the Bloom-filter algorithm is the best choice. For example, a public email provider like Yahoo,hotmail and Gmai always needs to filter spam from people who send spam (Spamer). One way to do that is to keep a record of e-mail addresses that send spam.
As those senders are constantly registering new addresses, the world says there are billions of spam addresses, and it requires a large number of Web servers to save them. The Prum filter was presented by Barton Prum in 1970. It is actually a very long binary vector and a series of random mapping functions.
We use the above example to explain how the work works. Assuming we store 100 million e-mail addresses, we first set up a 1,600,000,002-bit, or 200 million-byte vector, and then set all 1.6 billion bits to zero. For each email address X, we use eight different random number generator (F1,F2, ..., F8) to produce eight information fingerprints (F1, F2, ..., F8). Then use a random number generator G to map these eight information fingerprints to eight natural numbers in 1 to 1.6 billion G1, G2, ..., G8. Now we set the bits of these eight locations all to one. When we were on this 100 millionAn email address is processed in this way. A cloth-lung filter For these email addresses was built. (see below) Now, let's see how to detect a suspicious email address Y in the blacklist with a profiler filter. We use the same eight random number generator (F1, F2, ..., F8) to generate eight information fingerprints on this address s1,s2,..., S8, then the eight fingerprints to the Prum filter eight bits, respectively T1,t2,..., T8. If Y is in the blacklist, obviously, the T1,t2,.., T8 corresponding Eight binary must be one.
In this way, we can find any email address in the blacklist. The Prum filter never misses any suspicious address in the blacklist. However, it has one shortcoming. That is, it has a very small possibility to determine an e-mail address that is not in the blacklist as a blacklist, because it is possible that a good email address would have been set to match the eight bits. Fortunately, this possibility is very small. We call it a false probability.
In the above example, the probability of false recognition is below one out of 10,000. The advantage of the Prum filter is that it is fast and saves space. But there is a certain rate of false recognition.
A common remedy is to create a small whitelist that stores e-mail addresses that may not be misjudged.
*///Use PHP program to describe the above algorithm $set = array (1,2,3,4,5,6);
Determine whether 5 is in the $set $bloomFiter = Array (0,0,0,0,0,0,0,0,0,0); By using an algorithm to change the $bloomfiter array representation set, we use a simple algorithm that maps the corresponding value in the set to the position in bloom to 1//The algorithm follows foreach ($set as $key) {$bloomFite
r[$key] = 1;
} var_dump ($bloomFiter);
At this point $bloomFiter = Array (1,1,1,1,1,1);
Determines if ($bloomFiter [9] ==1) {echo ' in Set ') in the set;
}else{Echo ' is not in set '; //The above is just a simple example, in fact hashing algorithms require several, but on the other hand, if the number of hash functions is small, then 0 of the bit array is more class Bloom_filter {function __construct ($hasH_func_num=1, $space _group_num=1) {$max _length = POW (2, 25);
$binary = Pack (' C ', 0);
1 bytes occupy 8-bit $this->one_num = 8;
Default 32m*1 $this->space_group_num = $space _group_num;
$this->hash_space_assoc = Array (); Allocate space for ($i =0 $i < $this->space_group_num; $i + +) {$this->hash_space_assoc[$i] = str_repeat ($binary, $max _le
Ngth);
$this->pow_array = Array (0 => 1, 1 => 2, 2 => 4, 3 => 8, 4 =>, 5 => 32,
6 =>, 7 => 128,);
$this->chr_array = Array ();
$this->ord_array = Array ();
For ($i =0 $i <256; $i + +) {$CHR = Chr ($i);
$this->chr_array[$i] = $CHR;
$this->ord_array[$CHR] = $i;
$this->hash_func_pos = Array (0 => array (0, 7, 1), 1 => Array (7, 7, 1), 2 => Array (14, 7, 1),
3 => Array (7, 1), 4 => Array (7, 1), 5 => Array (, 7, 1), 6 => Array (17, 7, 1);
$this->write_num = 0;
$this->ext_num = 0; IF (! $hash _func_num) {$this->hash_func_num = count ($this->hash_func_pos);
else{$this->hash_func_num = $hash _func_num; } function Add ($key) {$hash _bit_set_num = 0;//discrete key $hash _basic = SHA1 ($key);//intercept the first 4 digits, then hexadecimal converts to decimal $hash _s
Pace = Hexdec (substr ($hash _basic, 0, 4));
modulo $hash _space = $hash _space% $this->space_group_num; For ($hash _i=0 $hash _i< $this->hash_func_num; $hash _i++) {$hash = Hexdec (substr ($hash _basic, $this->hash_
func_pos[$hash _i][0], $this->hash_func_pos[$hash _i][1]));
$bit _pos = $hash >> 3;
$max = $this->ord_array[$this->hash_space_assoc[$hash _space][$bit _pos]];
$num = $hash-$bit _pos * $this->one_num;
$bit _pos_value = ($max >> $num) & 0x01;
if (! $bit _pos_value) {$max = $max | $this->pow_array[$num];
$this->hash_space_assoc[$hash _space][$bit _pos] = $this->chr_array[$max];
$this->write_num++;
} else{$hash _bit_set_num++; } if ($hash_bit_set_num = = $this->hash_func_num) {$this->ext_num++;
return true;
return false;
function Get_stat () {return array (' Ext_num ' => $this->ext_num, ' write_num ' => $this->write_num,
);
}//test//Fetch 6 hash values, currently is up to 7 $hash _func_num = 6;
Allocate 1 storage space, each space is 32M, the more space is theoretically the greater the misjudgment rate is lower, note the memory limit that can be used in php.ini $space _group_num = 1;
$BF = new Bloom_filter ($hash _func_num, $space _group_num);
$list = Array (' HTTP://TEST/1 ', ' http://test/2 ', ' http://test/3 ', ' http://test/4 ', ' http://test/5 ', ' HTTP://TEST/6 ',
' Http://test/1 ', ' http://test/2 ',);
foreach ($list as $k => $v) {if ($BF->add ($v)) {echo $v, "\ n";
} print_r ($BF->get_stat ());