Implement Bloom filter algorithm in PHP, bloomfilter_php tutorial

Source: Internet
Author: User

Implement the Bloom filter algorithm in PHP, Bloomfilter


<?php/*bloom filter algorithm to go to heavy filtering. This paper introduces the basic processing idea of bloom filter: Apply for a batch of space to save 0 1 information, then determine the position of the element according to a batch of hash functions, if the value of each hash function corresponding position is all 1, this element exists. Conversely, if it is 0, set the value of the corresponding position to 1. Because different elements may have the same hash value, that is, the same location has the potential to hold more than one element of information, resulting in a certain rate of miscarriage. If the application space is too small, with the increase in the number of elements, 1 will be more and more, each element of the opportunity to conflict more and more, resulting in a greater rate of miscarriage. In addition, the selection and number of hash functions should be balanced, although multiple hash functions can provide the accuracy of judgment, but will reduce the processing speed of the program, and the increase of the hash function requires more space to store the location information.  Application of Bloom-filter. Bloom-filter is typically used to determine whether an element exists in a set of large data volumes. For example, a spam filter in a mail server. In the Search engine field, Bloom-filter is most commonly used for web spider (spider) URL filtering, Web Spiders usually have a URL list, save the download and have downloaded the URL of the Web page, Web spider downloaded a Web page, extracted from the page to the new URL, You need to determine if the URL already exists in the list.   At this point, the Bloom-filter algorithm is the best choice. For example, a public email provider, like Yahoo,hotmail and Gmai, always needs to filter spam from people who send spam (Spamer). One way to do this is to keep a record of the email addresses that were sent to spam.   Since those senders are constantly registering new addresses, there are billions of more spam addresses around the world, and it takes a lot of Web servers to save them all. The Bron filter was proposed by Barton Bron in 1970. It is actually a very long binary vector and a series of random mapping functions.  We use the above example to illustrate how it works. Assuming we store 100 million e-mail addresses, we first set up a 1,600,000,002 binary (bit), or 200 million-byte vector, and then all of the 1.6 billion bits to zero. For each e-mail address X, we use eight different random number generator (F1,F2, ..., F8) to generate eight information fingerprints (F1, F2, ..., F8). Then using a random number generator G to map these eight information fingerprints to eight natural numbers from 1 to 1.6 billion G1, G2, ..., G8. Now let's set the bits of all eight positions to one. When we do this for all 100 million email addressesAfter processing. A filter for these email addresses was built. Now, let's look at how to use the filter to detect whether a suspicious email address, Y, is in the blacklist. We use the same eight random number generator (F1, F2, ..., F8) to generate eight information fingerprints for this address s1,s2,..., S8, and then correspond these eight fingerprints to the Bron filter eight bits, respectively T1,t2,..., T8. If Y is in the blacklist, it is clear that the T1,T2,.., T8 corresponding Eight binary must be one.   This way, we can find out exactly what the email address is in the blacklist. The Bron filter never misses any suspicious address in the blacklist. However, it has one shortcoming. That is, it has a very small likelihood that an e-mail address that is not blacklisted is determined to be blacklisted, because it is possible that a good email address happens to correspond to eight bits that are set to one. Fortunately, this is a very small possibility. We call it the probability of false recognition.   In the above example, the probability of false identification is below one out of 10,000. The advantage of the Bron filter is that it is fast and saves space. But there is a certain rate of false recognition.  A common remedy is to create a small whitelist that stores e-mail addresses that may not be misjudged. *///uses a PHP program to describe the above algorithm $set = Array (1,2,3,4,5,6),//To determine if 5 is $bloomFiter = Array (0,0,0,0,0,0,0,0,0,0) in $set,//By some algorithm to change the $ The Bloomfiter array represents the collection, where we use a simple algorithm that corresponds to the value in the set corresponding to the position in the bloom into a 1//algorithm such as the following foreach ($set as $key) {$bloomFiter [$key] = 1;} Var_dump ($bloomFiter);  At this point $bloomFiter = Array (1,1,1,1,1,1);//Determine if in the collection if ($bloomFiter [9] ==1) {echo ' in set ';   }else{Echo ' not in set ';} The above is just a simple example, in fact the hashing algorithm needs several, but on the other hand, if the number of hash function is small, then the bit array of 0 is more than the class Bloom_filter {function __construct ($hash _func_num=1, $  Space_group_num=1) {$max _length = POW (2, 25); $binary = Pack (' C ', 0);  1 bytes occupies 8 bits $this->one_num = 8;  Default 32m*1 $this->space_group_num = $space _group_num;  $this->hash_space_assoc = Array (); Allocate space for ($i =0; $i < $this->space_group_num; $i + +) {$this->hash_space_assoc[$i] = str_repeat ($binary, $max _  length); } $this->pow_array = Array (0 = 1, 1 = 2, 2 = 4, 3 = 8, 4 =, 5 =, 6 =  , 7 = 128,);  $this->chr_array = Array ();  $this->ord_array = Array ();   for ($i =0; $i <256; $i + +) {$CHR = Chr ($i);   $this->chr_array[$i] = $CHR;  $this->ord_array[$CHR] = $i; } $this->hash_func_pos = Array (0 = = Array (0, 7, 1), 1 = = Array (7, 7, 1), 2 = = Array (+, 7, 1), 3 =&G T  Array (7, 1), 4 = Array (7, 1), 5-= Array (7, 1), 6-= Array (17, 7, 1),);  $this->write_num = 0;  $this->ext_num = 0;  if (! $hash _func_num) {$this->hash_func_num = count ($this->hash_func_pos); } else{$this->hash_func_num = $hash _func_num; }} function Add ($key) {$hash _bit_set_num = 0;//discrete key $hash _basic = SHA1 ($key);//intercept first 4 bits, then hexadecimal to decimal $hash _space = h  Exdec (substr ($hash _basic, 0, 4));//modulo $hash _space = $hash _space% $this->space_group_num; for ($hash _i=0; $hash _i< $this->hash_func_num; $hash _i++) {$hash = Hexdec (substr ($hash _basic, $this->hash_   func_pos[$hash _i][0], $this->hash_func_pos[$hash _i][1]));   $bit _pos = $hash >> 3;   $max = $this->ord_array[$this->hash_space_assoc[$hash _space][$bit _pos]];   $num = $hash-$bit _pos * $this->one_num;   $bit _pos_value = ($max >> $num) & 0x01;    if (! $bit _pos_value) {$max = $max | $this->pow_array[$num];    $this->hash_space_assoc[$hash _space][$bit _pos] = $this->chr_array[$max];   $this->write_num++;   } else{$hash _bit_set_num++;   }} if ($hash _bit_set_num = = $this->hash_func_num) {$this->ext_num++;  return true; } return false; } function Get_stat () {return Array (' Ext_nuM ' = = $this->ext_num, ' write_num ' = $this->write_num,); }}//test//takes 6 hashes, is currently up to 7 $hash_func_num = 6;//Allocates 1 storage space, each space is 32M, theoretically the larger the space is the lower the false rate, note the memory limit that can be used in php.ini $space_group_num = 1; $bf = new Bloom_filter ($hash _func_num, $space _group_num); $list = Array (' HTTP://TEST/1 ', ' http://test/2 ', ' http:// Test/3 ', ' http://test/4 ', ' http://test/5 ', ' HTTP://TEST/6 ', ' http://test/1 ', ' http://test/2 ',); foreach ($list as $k = > $v) {if ($BF->add ($v)) {echo $v, "\ n";}} Print_r ($BF->get_stat ());

http://www.bkjia.com/PHPjc/976027.html www.bkjia.com true http://www.bkjia.com/PHPjc/976027.html techarticle PHP implements the Bloom filter algorithm, Bloomfilter php/*bloom filter algorithm to go heavy filtering. Introduce the basic processing idea of bloom filter: Apply for a batch of space to save 0 1 information, then according to ...

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.