1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21st 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
/*bloom filter algorithm to go to heavy filtering. This paper introduces the basic processing idea of bloom filter: Apply for a batch of space to save 0 1 information, then determine the position of the element according to a batch of hash functions, if the value of each hash function corresponding position is all 1, this element exists. Conversely, if it is 0, set the value of the corresponding position to 1. Because different elements may have the same hash value, that is, the same location has the potential to hold more than one element of information, resulting in a certain rate of miscarriage. If the application space is too small, with the increase in the number of elements, 1 will be more and more, each element of the opportunity to conflict more and more, resulting in a greater rate of miscarriage. In addition, the selection and number of hash functions should be balanced, although multiple hash functions can provide the accuracy of judgment, but will reduce the processing speed of the program, and the increase of the hash function requires more space to store the location information. Application of Bloom-filter. Bloom-filter is typically used to determine whether an element exists in a set of large data volumes. For example, a spam filter in a mail server. In the Search engine field, Bloom-filter is most commonly used for web spider (spider) URL filtering, Web Spiders usually have a URL list, save the download and have downloaded the URL of the Web page, Web spider downloaded a Web page, extracted from the page to the new URL, You need to determine if the URL already exists in the list. At this point, the Bloom-filter algorithm is the best choice. For example, a public email provider, like Yahoo,hotmail and Gmai, always needs to filter spam from people who send spam (Spamer). One way to do this is to keep a record of the email addresses that were sent to spam. Since those senders are constantly registering new addresses, there are billions of more spam addresses around the world, and it takes a lot of Web servers to save them all. The Bron filter was proposed by Barton Bron in 1970. It is actually a very long binary vector and a series of random mapping functions. We use the above example to illustrate how it works. Assuming we store 100 million e-mail addresses, we first set up a 1,600,000,002 binary (bit), or 200 million-byte vector, and then all of the 1.6 billion bits to zero. For each e-mail address X, we use eight different random number generator (F1,F2, ..., F8) to generate eight information fingerprints (F1, F2, ..., F8). Then using a random number generator G to map these eight information fingerprints to eight natural numbers from 1 to 1.6 billion G1, G2, ..., G8. Now let's set the bits of all eight positions to one. When we do this with all 100 million email addresses. A filter for these email addresses was built. Now, let's look at how to use the filter to detect whether a suspicious email address, Y, is in the blacklist. We use the same eight random number generator (F1, F2, ..., F8) to generate eight information fingerprints for this address s1,s2,..., S8, and then correspond these eight fingerprints to the Bron filter eight bits, respectively T1,t2,..., T8. If Y is in the blacklist, it is clear that the T1,T2,.., T8 corresponding Eight binary must be one. This way, we can find out exactly what the email address is in the blacklist. The Bron filter never misses any suspicious address in the blacklist. However, it has one shortcoming. That is, it has a very small likelihood that an e-mail address that is not blacklisted is determined to be blacklisted, because it is possible that a good email address happens to correspond to eight bits that are set to one. Fortunately, this is a very small possibility. We call it the probability of false recognition. In the above example, the probability of false identification is below one out of 10,000. The advantage of the Bron filter is that it is fast and saves space. But there is a certain rate of false recognition. A common remedy is to create a small whitelist that stores e-mail addresses that may not be misjudged. */ Use a PHP program to describe the algorithm above $set = Array (1,2,3,4,5,6); Determine if 5 is in $set $bloomFiter = Array (0,0,0,0,0,0,0,0,0,0); Using an algorithm to change the $bloomfiter array representation set, here we use a simple algorithm, the corresponding value in the set corresponding to the position in the bloom into 1 The algorithm is as follows foreach ($set as $key) { $bloomFiter [$key] = 1; } Var_dump ($bloomFiter); At this point $bloomFiter = Array (1,1,1,1,1,1); Determines whether in the collection if ($bloomFiter [9] ==1) { Echo ' in set '; }else{ Echo ' is not in set '; } The above is just a simple example, in fact the hashing algorithm needs several, but on the other hand, if the number of hash function is small, then the bit array of 0 more Class Bloom_filter { function __construct ($hash _func_num=1, $space _group_num=1) { $max _length = POW (2, 25); $binary = Pack (' C ', 0); 1 bytes Occupied 8 bits $this->one_num = 8; Default 32m*1 $this->space_group_num = $space _group_num; $this->hash_space_assoc = Array (); Allocate space for ($i =0; $i < $this->space_group_num; $i +) { $this->hash_space_assoc[$i] = str_repeat ($binary, $max _length); } $this->pow_array = Array ( 0 = 1, 1 = 2, 2 = 4, 3 = 8, 4 = 16, 5 = 32, 6 = 64, 7 = 128, ); $this->chr_array = Array (); $this->ord_array = Array (); for ($i =0; $i <256; $i + +) { $CHR = Chr ($i); $this->chr_array[$i] = $CHR; $this->ord_array[$CHR] = $i; } $this->hash_func_pos = Array ( 0 = Array (0, 7, 1), 1 = Array (7, 7, 1), 2 = Array (14, 7, 1), 3 = Array (21, 7, 1), 4 = Array (28, 7, 1), 5 = Array (33, 7, 1), 6 = Array (17, 7, 1), ); $this->write_num = 0; $this->ext_num = 0; if (! $hash _func_num) { $this->hash_func_num = count ($this->hash_func_pos); } else{ $this->hash_func_num = $hash _func_num; } } function Add ($key) { $hash _bit_set_num = 0; Discrete key $hash _basic = SHA1 ($key); Intercept the first 4 bits, then hexadecimal to decimal $hash _space = Hexdec (substr ($hash _basic, 0, 4)); Take the mold $hash _space = $hash _space% $this->space_group_num; for ($hash _i=0; $hash _i< $this->hash_func_num; $hash _i++) { $hash = Hexdec (substr ($hash _basic, $this->hash_func_pos[$hash _i][0], $this->hash_func_pos[$hash _i][1])); $bit _pos = $hash >> 3; $max = $this->ord_array[$this->hash_space_assoc[$hash _space][$bit _pos]]; $num = $hash-$bit _pos * $this->one_num; $bit _pos_value = ($max >> $num) & 0x01; if (! $bit _pos_value) { $max = $max | $this->pow_array[$num]; $this->hash_space_assoc[$hash _space][$bit _pos] = $this->chr_array[$max]; $this->write_num++; } else{ $hash _bit_set_num++; } } if ($hash _bit_set_num = = $this->hash_func_num) { $this->ext_num++; return true; } return false; } function Get_stat () { Return Array ( ' Ext_num ' = $this->ext_num, ' Write_num ' = $this->write_num, ); } } Test Fetch 6 hashes, currently up to 7 $hash _func_num = 6; Allocate 1 storage space, each space is 32M, theoretically the higher the rate of false errors, note the memory limitations that can be used in php.ini $space _group_num = 1; $BF = new Bloom_filter ($hash _func_num, $space _group_num); $list = Array ( ' HTTP://TEST/1 ', ' HTTP://TEST/2 ', ' Http://test/3 ', ' Http://test/4 ', ' HTTP://TEST/5 ', ' HTTP://TEST/6 ', ' HTTP://TEST/1 ', ' HTTP://TEST/2 ', ); foreach ($list as $k = = $v) { if ($BF->add ($v)) { echo $v, "\ n"; } } |