How to implement the BloomFilter algorithm in PHP _ PHP Tutorial

Source: Internet
Author: User
Implement the BloomFilter algorithm in PHP. Implementation of the BloomFilter algorithm in PHP this article mainly introduces the implementation of the BloomFilter algorithm in PHP. This article provides the implementation code directly, the code provides detailed comments, the introduction of the BloomFilter algorithm and other content in PHP to implement the Bloom Filter algorithm.

This article mainly introduces how to implement the Bloom Filter algorithm in PHP. This article provides the implementation code, detailed comments in the code, and descriptions of the Bloom Filter algorithm. For more information, see

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

/* The Bloom Filter algorithm is used for deduplication.

This section describes the basic processing logic of the Bloom Filter: apply for a batch of space to store 0 1 information, and then determine the corresponding position of the element based on a batch of hash functions, if the value of all 1 corresponding to each hash function, this element exists. On the contrary, if the value is 0, set the value of the corresponding position to 1. Different elements may have the same hash value, that is, the information of multiple elements may be stored in the same position, resulting in a certain false positive rate.

If the application space is too small, as the number of elements increases, the number of elements increases, and the chance of conflicting elements increases, leading to a greater false positive rate. In addition, the selection and number of hash functions should also be well balanced. although multiple hash functions can provide accuracy for judgment, it will reduce the processing speed of the program, the addition of hash functions requires more space to store location information.

Application of Bloom-Filter.

Bloom-Filter is generally used to determine whether an element exists in a collection of large data volumes. For example, the spam filter in the mail server. In the search engine field, Bloom-Filter is most commonly used for URL filtering by Spider. a web Spider usually has a URL list that stores the URLs of the web pages to be downloaded and downloaded, after a web page is downloaded from a web page and a new URL is extracted from the web page, you need to determine whether the URL already exists in the list. In this case, the Bloom-Filter algorithm is the best choice.

For example, a public email (email) provider like Yahoo, Hotmail, and Gmai always needs to filter spamer mails from spamer. One way is to record the e-mail addresses of spam. Since the senders are constantly registering new addresses, the world seldom says that there are billions of spam addresses. Therefore, a large number of network servers are required to store them.

The Bloom filter was proposed by Barton bloom in 1970. It is actually a very long binary vector and a series of random ing functions. The preceding example shows how the job works.

Assuming that we store 0.1 billion email addresses, we first create a 1.6 billion binary (bit) vector, that is, a 0.2 billion-byte vector, and then set all the 1.6 billion binary bits to zero. For each email address X, we use eight different random number generators (F1, F2 ,..., f8) generates eight information fingerprints (f1, f2 ,..., f8 ). Use a random number generator G to map these eight information fingerprints to eight natural numbers g1, g2,... and g8 in the range of 1 to 1.6 billion. Now we set all the binary bits in these eight locations to one. After processing all the 0.1 billion email addresses in this way. A Bloom filter for these email addresses is built. (SEE) now, let's see how to use the Bloom filter to check whether a suspicious email address Y is in the blacklist. We use the same eight random number generators (F1, F2 ,..., f8) generates eight fingerprints for this address: s1, s2 ,..., s8, and then map the eight fingerprints to the eight binary digits of the Bloom filter, t1, t2 ,..., t8. If Y is in the blacklist, it is clear that the eight binary values corresponding to t1, t2,... and t8 must be one. In this way, we can accurately find any email address in the blacklist.

The Bloom filter will never miss any suspicious address in the blacklist. However, it has one disadvantage. That is, it is very small that it may judge an email address that is not in the blacklist as in the blacklist, it is possible that a good email address corresponds to eight binary digits. Fortunately, this possibility is very small. We call it false recognition probability. In the preceding example, the probability of false recognition is less than one thousandth.

The advantage of Bloom filter is that it is fast and saves space. However, there is a certain false recognition rate. A common remedy is to create a small whitelist to store mail addresses that may not be misjudged.

*/

// Use the php program to describe the above algorithm

$ Set = array (1, 2, 3, 4, 5, 6 );

// Judge whether 5 is in $ set

$ BloomFiter = array (0, 0, 0, 0, 0, 0, 0, 0 );

// Use an algorithm to change the $ bloomFiter median group to represent a set. here we use a simple algorithm to change the value corresponding to the set to the position in the bloom to 1.

// The algorithm is as follows:

Foreach ($ set as $ key ){

$ BloomFiter [$ key] = 1;

}

Var_dump ($ bloomFiter );

// $ BloomFiter = array );

// Judge whether the collection is in

If ($ bloomFiter [9] = 1 ){

Echo 'in set ';

} Else {

Echo 'not in set ';

}

// The above is just a simple example. In fact, there are several hash algorithms, but on the other hand, if the number of hash functions is small, there will be more 0 values in the array.

Class bloom_filter {

Function _ construct ($ hash_func_num = 1, $ space_group_num = 1 ){

$ Max_length = pow (2, 25 );

$ Binary = pack ('C', 0 );

// 1 byte occupies 8 digits

$ This-> one_num = 8;

// 32 MB * 1 by default

$ This-> space_group_num = $ space_group_num;

$ This-> hash_space_assoc = array ();

// Allocate space

For ($ I = 0; $ I <$ this-> space_group_num; $ I ++ ){

$ This-> hash_space_assoc [$ I] = str_repeat ($ binary, $ max_length );

}

$ This-> pow_array = array (

0 => 1,

1 => 2,

2 => 4,

3 => 8,

4 => 16,

5 => 32,

6 => 64,

7 => 128,

);

$ This-> chr_array = array ();

$ This-> ord_array = array ();

For ($ I = 0; I I <256; $ I ++ ){

$ Chr = chr ($ I );

$ This-> chr_array [$ I] = $ chr;

$ This-> ord_array [$ chr] = $ I;

}

$ This-> hash_func_pos = array (

0 => array (0, 7, 1 ),

1 => array (7, 7, 1 ),

2 => array (14, 7, 1 ),

3 => array (21, 7, 1 ),

4 => array (28, 7, 1 ),

5 => array (33, 7, 1 ),

6 => array (17, 7, 1 ),

);

$ This-> write_num = 0;

$ This-> ext_num = 0;

If (! $ Hash_func_num ){

$ This-> hash_func_num = count ($ this-> hash_func_pos );

}

Else {

$ This-> hash_func_num = $ hash_func_num;

}

}

Function add ($ key ){

$ Hash_bit_set_num = 0;

// Discrete key

$ Hash_basic = sha1 ($ key );

// Intercept the first 4 digits and convert the hexadecimal format to decimal.

$ Hash_space = hexdec (substr ($ hash_basic, 0, 4 ));

// Modulo

$ Hash_space = $ hash_space % $ this-> space_group_num;

For ($ hash_ I = 0; $ hash_ I <$ this-> hash_func_num; $ hash_ I ++ ){

$ Hash = hexdec (substr ($ hash_basic, $ this-> hash_func_pos [$ hash_ I] [0], $ this-> hash_func_pos [$ hash_ I] [1]);

$ Bit_pos = $ hash> 3;

$ Max = $ this-> ord_array [$ this-> hash_space_assoc [$ hash_space] [$ bit_pos];

$ Num = $ hash-$ bit_pos * $ this-> one_num;

$ Bit_pos_value = ($ max >>$ num) & 0x01;

If (! $ Bit_pos_value ){

$ Max = $ max | $ this-> pow_array [$ num];

$ This-> hash_space_assoc [$ hash_space] [$ bit_pos] = $ this-> chr_array [$ max];

$ This-> write_num ++;

}

Else {

$ Hash_bit_set_num ++;

}

}

If ($ hash_bit_set_num = $ this-> hash_func_num ){

$ This-> ext_num ++;

Return true;

}

Return false;

}

Function get_stat (){

Return array (

'Ext _ num' => $ this-> ext_num,

'Write _ num' => $ this-> write_num,

);

}

}

// Test

// Obtain six hash values. Currently, a maximum of seven hash values can be obtained.

$ Hash_func_num = 6;

// Allocate 1 bucket, each of which is 32 MB. Theoretically, the larger the space, the lower the false positive rate. pay attention to the memory restrictions available in php. ini.

$ Space_group_num = 1;

$ Bf = new bloom_filter ($ hash_func_num, $ space_group_num );

$ List = array (

'Http: // test/1 ',

'Http: // test/2 ',

'Http: // test/3 ',

'Http: // test/4 ',

'Http: // test/5 ',

'Http: // test/6 ',

'Http: // test/1 ',

'Http: // test/2 ',

);

Foreach ($ list as $ k => $ v ){

If ($ bf-> add ($ v )){

Echo $ v, "\ n ";

}

}

The dirty Filter algorithm this article mainly introduces the implementation of the Bloom Filter algorithm in PHP. This article provides the implementation code, detailed comments in the code, and descriptions of the Bloom Filter algorithm...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.