Php implements Bloom Filter and phpbloomfilter

Source: Internet
Author: User
Tags bitset random seed

Php implements Bloom Filter and phpbloomfilter

Bloom Filter (BF) is a fast search algorithm proposed by Bloom in 1970 for multi-Hash function ing.FastChecks whether an element belongs to a set, but does not require accuracy. Bloom filter is usually used to deduplicate The crawler url, that is, to determine whether a url has been crawled. In terms of principle, I have referenced an article by someone else. I will not repeat it here. For more information, refer to this article. After reading several php implementations of BF, I think the readability is not very strong. This article mainly shows my php implementation of Bloom Filter.

Principle:

<Reference from this article>

I. Instance

To illustrate the significance of the existence of the Bloom Filter, let's take an example:

Suppose you want to write a web crawler ). Due to the complexity of links between networks, spider crawling between networks is likely to form a "ring ". To avoid forming a "ring", you need to know the URLs that the spider has accessed. How do I know if a spider has accessed a URL? Think about the following solutions:

1. Save the accessed URL to the database.

2. Use HashSet to save the accessed URL. You can check whether a URL has been accessed at the price close to O (1.

3. the URL is retained to the HashSet or database after MD5 or SHA-1 hash.

4. Bit-Map method. Create a BitSet to map each URL to a bit through a hash function.

Method 1 ~ 3. All accessed URLs are completely saved. Method 4 only marks a ing bit of the URL.

The above methods can solve the problem perfectly when the data volume is small, but the problem arises when the data volume becomes very large.

Disadvantage of Method 1: the efficiency of relational database query becomes very low when the data volume becomes very large. And every time a URL is sent, a database query is started, isn't it too trivial?

Disadvantage of Method 2: too much memory consumption. As the number of URLs increases, memory usage increases. Even if there are only 0.1 billion URLs, each URL is 50 characters long and requires 5 GB of memory.

Method 3: Because the digest length of the string after MD5 processing is only bits and SHA-1 processing is only 160bits, method 3 saves several times of memory than method 2.

Method 4 consumes a relatively small amount of memory, but the disadvantage is that the probability of a single hash function conflict is too high. Do you still remember the various solutions to Hash table conflicts in data structure? To reduce the probability of a conflict to 1%, set the length of the BitSet to 100 times the number of URLs.

In essence, the above algorithm ignores an important implicit condition: the error with a low probability is not necessarily 100% accurate! That is to say, a small number of URLs are not actually accessible to web crawlers, and the cost of misjudgment on them is very small-a big deal to catch a few webpages.




2. Bloom Filter Algorithm

Here, we will introduce Bloom Filter, the main character of this article. In fact, the idea of Method 4 above is very close to Bloom Filter. The fatal disadvantage of Method 4 is the high probability of conflict. To reduce the concept of conflict, Bloom Filter uses multiple hash functions instead of one.

The Bloom Filter algorithm is as follows:

(1) initialization

Create an m-bit BitSet, Initialize all bits to 0, and then select k different hash functions. Hash function I records the hash result of string str as h (I, str), and the range of h (I, str) is 0 M-1.

(2) check whether the string exists

 
The following is the process of checking whether the str string has been recorded by BitSet:

For str strings, calculate h (1, str), h (2, str )...... H (k, str ). Then check the BitSet's h (1, str), h (2, str )...... Whether the h (k, str) bit is 1. If any of them is not 1, it can be determined that str has never been recorded. If all bits are 1, "think" string str exists.

If the Bit of a string is not all 1, it is certain that the string has not been recorded by the Bloom Filter. (This is obvious, because the string has been recorded, and all its corresponding binary bits must be set to 1)

However, if the bits corresponding to a string are all 1, it cannot be 100% sure that the string has been recorded by the Bloom Filter. (Because it is possible that all the bits of this string are matched by other strings) This case of dividing the string incorrectly is called false positive.

(3) Delete A String:

When a string is added, it cannot be deleted because deletion affects other strings. Counting bloomfilter (CBMs) can be used to delete strings. This is a variant of the basic Bloom Filter. each Bit of the basic Bloom Filter is changed to a counter by CBMs, in this way, you can delete strings.

The Bit-Map function of the Bloom Filter documentary hash function is different in that the Bloom Filter uses k hash functions, and each string corresponds to k bits. This reduces the probability of conflict.




3. Bloom Filter parameter selection

(1) Selection of Hash Functions

The selection of hash functions has a great impact on performance. A good hash function must map strings to each Bit with an approximate probability. Selecting k different hash functions is troublesome. A simple method is to select a hash function and then input k different parameters.

(2) Bit Array Size Selection

The relationship between the number of hash functions k, the size of the Bit Array m, and the number of strings n can be found in reference 1. This document proves that for a given m, n, the probability of an error when k = ln (2) * m/n is the smallest.

At the same time, this document also gives the specific k, m, n error probability. For example, if the number of hash functions k is 10 and the size of the Bit Array m is set to 20 times the number of strings n, the probability of false positive occurrence is 0.0000889, this probability can basically meet the needs of web crawlers.

Implementation:
<?php///***************************************************************************// * // * Copyright (c) 2015 Baidu.com, Inc. All Rights Reserved// * // **************************************************************************/// // // ///**// * @file bloomfilter.php// * @author Rachel Zhang(zrqsophia@sina.com)// * @date 2015/07/24 18:48:57// * @version $Revision$ // * @brief // *  // **/class BloomFilter{    var $m; # blocksize    var $n; # number of strings to hash    var $k; # number of hashing functions    var $bitset; # hashing block with size m    function BloomFilter($mInit,$nInit){        $this->m = $mInit;        $this->n = $nInit;        $this->k = ceil(($this->m/$this->n)*log(2));        echo "number of functions: $this->k\n";        $this->bitset = array_fill(0, $this->m, false);    }    function hashcode($str){        $res = array(); #put k hashing bit into $res        $seed = crc32($str);        mt_srand($seed); // set random seed, or mt_rand wouldn't provide same random arrays at different generation        for($i=0 ; $i<$this->k ; $i++){            $res[] = mt_rand(0,$this->m-1);        }        return $res;    }    function addKey($key){        foreach($this->hashcode($key) as $codebit){            $this->bitset[$codebit]=true;        }    }    function existKey($key){        $code=$this->hashcode($key);        foreach($code as $codebit){            if($this->bitset[$codebit]==false){                return false;            }        }        return true;    }}$bf = new BloomFilter(10,2);$str_add1 = "test1";$str_add2 = "test2";$str_notadd3 = "test3";//var_dump($bf->hashcode($str));$bf->addKey($str_add1);$bf->addKey($str_add2);var_dump($bf->existKey($str_add1));var_dump($bf->existKey($str_add2));var_dump($bf->existKey($str_notadd3));?>

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.