1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
/* The Bloom Filter algorithm is used for deduplication. This section describes the basic processing logic of the Bloom Filter: apply for a batch of space to store 0 1 information, and then determine the corresponding position of the element based on a batch of hash functions, if the value of all 1 corresponding to each hash function, this element exists. On the contrary, if the value is 0, set the value of the corresponding position to 1. Different elements may have the same hash value, that is, the information of multiple elements may be stored in the same position, resulting in a certain false positive rate. If the application space is too small, as the number of elements increases, the number of elements increases, and the chance of conflicting elements increases, leading to a greater false positive rate. In addition, the selection and number of hash functions should also be well balanced. although multiple hash functions can provide accuracy for judgment, it will reduce the processing speed of the program, the addition of hash functions requires more space to store location information. Application of Bloom-Filter. Bloom-Filter is generally used to determine whether an element exists in a collection of large data volumes. For example, the spam filter in the mail server. In the search engine field, Bloom-Filter is most commonly used for URL filtering by Spider. a web Spider usually has a URL list that stores the URLs of the web pages to be downloaded and downloaded, after a web page is downloaded from a web page and a new URL is extracted from the web page, you need to determine whether the URL already exists in the list. In this case, the Bloom-Filter algorithm is the best choice. For example, a public email (email) provider like Yahoo, Hotmail, and Gmai always needs to filter spamer mails from spamer. One way is to record the e-mail addresses of spam. Since the senders are constantly registering new addresses, the world seldom says that there are billions of spam addresses. Therefore, a large number of network servers are required to store them. The Bloom filter was proposed by Barton bloom in 1970. It is actually a very long binary vector and a series of random ing functions. The preceding example shows how the job works. Assuming that we store 0.1 billion email addresses, we first create a 1.6 billion binary (bit) vector, that is, a 0.2 billion-byte vector, and then set all the 1.6 billion binary bits to zero. For each email address X, we use eight different random number generators (F1, F2 ,..., f8) generates eight information fingerprints (f1, f2 ,..., f8 ). Use a random number generator G to map these eight information fingerprints to eight natural numbers g1, g2,... and g8 in the range of 1 to 1.6 billion. Now we set all the binary bits in these eight locations to one. After processing all the 0.1 billion email addresses in this way. A Bloom filter for these email addresses is built. (SEE) now, let's see how to use the Bloom filter to check whether a suspicious email address Y is in the blacklist. We use the same eight random number generators (F1, F2 ,..., f8) generates eight fingerprints for this address: s1, s2 ,..., s8, and then map the eight fingerprints to the eight binary digits of the Bloom filter, t1, t2 ,..., t8. If Y is in the blacklist, it is clear that the eight binary values corresponding to t1, t2,... and t8 must be one. In this way, we can accurately find any email address in the blacklist. The Bloom filter will never miss any suspicious address in the blacklist. However, it has one disadvantage. That is, it is very small that it may judge an email address that is not in the blacklist as in the blacklist, it is possible that a good email address corresponds to eight binary digits. Fortunately, this possibility is very small. We call it false recognition probability. In the preceding example, the probability of false recognition is less than one thousandth. The advantage of Bloom filter is that it is fast and saves space. However, there is a certain false recognition rate. A common remedy is to create a small whitelist to store mail addresses that may not be misjudged. */ // Use the php program to describe the above algorithm $ Set = array (1, 2, 3, 4, 5, 6 ); // Judge whether 5 is in $ set $ BloomFiter = array (0, 0, 0, 0, 0, 0, 0, 0 ); // Use an algorithm to change the $ bloomFiter median group to represent a set. here we use a simple algorithm to change the value corresponding to the set to the position in the bloom to 1. // The algorithm is as follows: Foreach ($ set as $ key ){ $ BloomFiter [$ key] = 1; } Var_dump ($ bloomFiter ); // $ BloomFiter = array ); // Judge whether the collection is in If ($ bloomFiter [9] = 1 ){ Echo 'in set '; } Else { Echo 'not in set '; } // The above is just a simple example. In fact, there are several hash algorithms, but on the other hand, if the number of hash functions is small, there will be more 0 values in the array. Class bloom_filter { Function _ construct ($ hash_func_num = 1, $ space_group_num = 1 ){ $ Max_length = pow (2, 25 ); $ Binary = pack ('C', 0 ); // 1 byte occupies 8 digits $ This-> one_num = 8; // 32 MB * 1 by default $ This-> space_group_num = $ space_group_num; $ This-> hash_space_assoc = array (); // Allocate space For ($ I = 0; $ I <$ this-> space_group_num; $ I ++ ){ $ This-> hash_space_assoc [$ I] = str_repeat ($ binary, $ max_length ); } $ This-> pow_array = array ( 0 => 1, 1 => 2, 2 => 4, 3 => 8, 4 => 16, 5 => 32, 6 => 64, 7 => 128, ); $ This-> chr_array = array (); $ This-> ord_array = array (); For ($ I = 0; I I <256; $ I ++ ){ $ Chr = chr ($ I ); $ This-> chr_array [$ I] = $ chr; $ This-> ord_array [$ chr] = $ I; } $ This-> hash_func_pos = array ( 0 => array (0, 7, 1 ), 1 => array (7, 7, 1 ), 2 => array (14, 7, 1 ), 3 => array (21, 7, 1 ), 4 => array (28, 7, 1 ), 5 => array (33, 7, 1 ), 6 => array (17, 7, 1 ), ); $ This-> write_num = 0; $ This-> ext_num = 0; If (! $ Hash_func_num ){ $ This-> hash_func_num = count ($ this-> hash_func_pos ); } Else { $ This-> hash_func_num = $ hash_func_num; } } Function add ($ key ){ $ Hash_bit_set_num = 0; // Discrete key $ Hash_basic = sha1 ($ key ); // Intercept the first 4 digits and convert the hexadecimal format to decimal. $ Hash_space = hexdec (substr ($ hash_basic, 0, 4 )); // Modulo $ Hash_space = $ hash_space % $ this-> space_group_num; For ($ hash_ I = 0; $ hash_ I <$ this-> hash_func_num; $ hash_ I ++ ){ $ Hash = hexdec (substr ($ hash_basic, $ this-> hash_func_pos [$ hash_ I] [0], $ this-> hash_func_pos [$ hash_ I] [1]); $ Bit_pos = $ hash> 3; $ Max = $ this-> ord_array [$ this-> hash_space_assoc [$ hash_space] [$ bit_pos]; $ Num = $ hash-$ bit_pos * $ this-> one_num; $ Bit_pos_value = ($ max >>$ num) & 0x01; If (! $ Bit_pos_value ){ $ Max = $ max | $ this-> pow_array [$ num]; $ This-> hash_space_assoc [$ hash_space] [$ bit_pos] = $ this-> chr_array [$ max]; $ This-> write_num ++; } Else { $ Hash_bit_set_num ++; } } If ($ hash_bit_set_num = $ this-> hash_func_num ){ $ This-> ext_num ++; Return true; } Return false; } Function get_stat (){ Return array ( 'Ext _ num' => $ this-> ext_num, 'Write _ num' => $ this-> write_num, ); } } // Test // Obtain six hash values. Currently, a maximum of seven hash values can be obtained. $ Hash_func_num = 6; // Allocate 1 bucket, each of which is 32 MB. Theoretically, the larger the space, the lower the false positive rate. pay attention to the memory restrictions available in php. ini. $ Space_group_num = 1; $ Bf = new bloom_filter ($ hash_func_num, $ space_group_num ); $ List = array ( 'Http: // test/1 ', 'Http: // test/2 ', 'Http: // test/3 ', 'Http: // test/4 ', 'Http: // test/5 ', 'Http: // test/6 ', 'Http: // test/1 ', 'Http: // test/2 ', ); Foreach ($ list as $ k => $ v ){ If ($ bf-> add ($ v )){ Echo $ v, "\ n "; } } |