_php Tutorial for implementing Bloom Filter algorithm in PHP

Source: Internet
Author: User

Implementation of Bloom filter algorithm in PHP


This article mainly introduces the implementation of the Bloom filter algorithm in PHP, this article directly give implementation code, the code gives detailed comments, Bloom filter algorithm introduction and other content, the need for friends can refer to the next

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21st

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

/*bloom filter algorithm to go to heavy filtering.

This paper introduces the basic processing idea of bloom filter: Apply for a batch of space to save 0 1 information, then determine the position of the element according to a batch of hash functions, if the value of each hash function corresponding position is all 1, this element exists. Conversely, if it is 0, set the value of the corresponding position to 1. Because different elements may have the same hash value, that is, the same location has the potential to hold more than one element of information, resulting in a certain rate of miscarriage.

If the application space is too small, with the increase in the number of elements, 1 will be more and more, each element of the opportunity to conflict more and more, resulting in a greater rate of miscarriage. In addition, the selection and number of hash functions should be balanced, although multiple hash functions can provide the accuracy of judgment, but will reduce the processing speed of the program, and the increase of the hash function requires more space to store the location information.

Application of Bloom-filter.

Bloom-filter is typically used to determine whether an element exists in a set of large data volumes. For example, a spam filter in a mail server. In the Search engine field, Bloom-filter is most commonly used for web spider (spider) URL filtering, Web Spiders usually have a URL list, save the download and have downloaded the URL of the Web page, Web spider downloaded a Web page, extracted from the page to the new URL, You need to determine if the URL already exists in the list. At this point, the Bloom-filter algorithm is the best choice.

For example, a public email provider, like Yahoo,hotmail and Gmai, always needs to filter spam from people who send spam (Spamer). One way to do this is to keep a record of the email addresses that were sent to spam. Since those senders are constantly registering new addresses, there are billions of more spam addresses around the world, and it takes a lot of Web servers to save them all.

The Bron filter was proposed by Barton Bron in 1970. It is actually a very long binary vector and a series of random mapping functions. We use the above example to illustrate how it works.

Assuming we store 100 million e-mail addresses, we first set up a 1,600,000,002 binary (bit), or 200 million-byte vector, and then all of the 1.6 billion bits to zero. For each e-mail address X, we use eight different random number generator (F1,F2, ..., F8) to generate eight information fingerprints (F1, F2, ..., F8). Then using a random number generator G to map these eight information fingerprints to eight natural numbers from 1 to 1.6 billion G1, G2, ..., G8. Now let's set the bits of all eight positions to one. When we do this with all 100 million email addresses. A filter for these email addresses was built. Now, let's look at how to use the filter to detect whether a suspicious email address, Y, is in the blacklist. We use the same eight random number generator (F1, F2, ..., F8) to generate eight information fingerprints for this address s1,s2,..., S8, and then correspond these eight fingerprints to the Bron filter eight bits, respectively T1,t2,..., T8. If Y is in the blacklist, it is clear that the T1,T2,.., T8 corresponding Eight binary must be one. This way, we can find out exactly what the email address is in the blacklist.

The Bron filter never misses any suspicious address in the blacklist. However, it has one shortcoming. That is, it has a very small likelihood that an e-mail address that is not blacklisted is determined to be blacklisted, because it is possible that a good email address happens to correspond to eight bits that are set to one. Fortunately, this is a very small possibility. We call it the probability of false recognition. In the above example, the probability of false identification is below one out of 10,000.

The advantage of the Bron filter is that it is fast and saves space. But there is a certain rate of false recognition. A common remedy is to create a small whitelist that stores e-mail addresses that may not be misjudged.

*/

Use a PHP program to describe the algorithm above

$set = Array (1,2,3,4,5,6);

Determine if 5 is in $set

$bloomFiter = Array (0,0,0,0,0,0,0,0,0,0);

Using an algorithm to change the $bloomfiter array representation set, here we use a simple algorithm, the corresponding value in the set corresponding to the position in the bloom into 1

The algorithm is as follows

foreach ($set as $key) {

$bloomFiter [$key] = 1;

}

Var_dump ($bloomFiter);

At this point $bloomFiter = Array (1,1,1,1,1,1);

Determines whether in the collection

if ($bloomFiter [9] ==1) {

Echo ' in set ';

}else{

Echo ' is not in set ';

}

The above is just a simple example, in fact the hashing algorithm needs several, but on the other hand, if the number of hash function is small, then the bit array of 0 more

Class Bloom_filter {

function __construct ($hash _func_num=1, $space _group_num=1) {

$max _length = POW (2, 25);

$binary = Pack (' C ', 0);

1 bytes Occupied 8 bits

$this->one_num = 8;

Default 32m*1

$this->space_group_num = $space _group_num;

$this->hash_space_assoc = Array ();

Allocate space

for ($i =0; $i < $this->space_group_num; $i +) {

$this->hash_space_assoc[$i] = str_repeat ($binary, $max _length);

}

$this->pow_array = Array (

0 = 1,

1 = 2,

2 = 4,

3 = 8,

4 = 16,

5 = 32,

6 = 64,

7 = 128,

);

$this->chr_array = Array ();

$this->ord_array = Array ();

for ($i =0; $i <256; $i + +) {

$CHR = Chr ($i);

$this->chr_array[$i] = $CHR;

$this->ord_array[$CHR] = $i;

}

$this->hash_func_pos = Array (

0 = Array (0, 7, 1),

1 = Array (7, 7, 1),

2 = Array (14, 7, 1),

3 = Array (21, 7, 1),

4 = Array (28, 7, 1),

5 = Array (33, 7, 1),

6 = Array (17, 7, 1),

);

$this->write_num = 0;

$this->ext_num = 0;

if (! $hash _func_num) {

$this->hash_func_num = count ($this->hash_func_pos);

}

else{

$this->hash_func_num = $hash _func_num;

}

}

function Add ($key) {

$hash _bit_set_num = 0;

Discrete key

$hash _basic = SHA1 ($key);

Intercept the first 4 bits, then hexadecimal to decimal

$hash _space = Hexdec (substr ($hash _basic, 0, 4));

Take the mold

$hash _space = $hash _space% $this->space_group_num;

for ($hash _i=0; $hash _i< $this->hash_func_num; $hash _i++) {

$hash = Hexdec (substr ($hash _basic, $this->hash_func_pos[$hash _i][0], $this->hash_func_pos[$hash _i][1]));

$bit _pos = $hash >> 3;

$max = $this->ord_array[$this->hash_space_assoc[$hash _space][$bit _pos]];

$num = $hash-$bit _pos * $this->one_num;

$bit _pos_value = ($max >> $num) & 0x01;

if (! $bit _pos_value) {

$max = $max | $this->pow_array[$num];

$this->hash_space_assoc[$hash _space][$bit _pos] = $this->chr_array[$max];

$this->write_num++;

}

else{

$hash _bit_set_num++;

}

}

if ($hash _bit_set_num = = $this->hash_func_num) {

$this->ext_num++;

return true;

}

return false;

}

function Get_stat () {

Return Array (

' Ext_num ' = $this->ext_num,

' Write_num ' = $this->write_num,

);

}

}

Test

Fetch 6 hashes, currently up to 7

$hash _func_num = 6;

Allocate 1 storage space, each space is 32M, theoretically the higher the rate of false errors, note the memory limitations that can be used in php.ini

$space _group_num = 1;

$BF = new Bloom_filter ($hash _func_num, $space _group_num);

$list = Array (

' HTTP://TEST/1 ',

' HTTP://TEST/2 ',

' Http://test/3 ',

' Http://test/4 ',

' HTTP://TEST/5 ',

' HTTP://TEST/6 ',

' HTTP://TEST/1 ',

' HTTP://TEST/2 ',

);

foreach ($list as $k = = $v) {

if ($BF->add ($v)) {

echo $v, "\ n";

}

}

http://www.bkjia.com/PHPjc/976540.html www.bkjia.com true http://www.bkjia.com/PHPjc/976540.html techarticle implementation of Bloom filter algorithm in PHP This article mainly introduces the implementation of the Bloom filter algorithm in PHP, this paper gives the implementation code, detailed comments in the code, Bloom filter algorithm introduction and other content ...

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.