BloomFilter (Bloom filter) principle and python support library, bloomfilterpython

Source: Internet
Author: User

BloomFilter (Bloom filter) principle and python support library, bloomfilterpython
Welcome to my blog

Bloom Filter is a fast search algorithm mapped to multiple hash functions. Generally, an application must quickly determine whether an element belongs to a set, but it is not strictly required to be 100% correct.
Bloom Filters, that is, Bloom Filters, will misjudge the elements that do not exist in the set. However, they will only make the elements that do not exist in the set exist in the Set, rather than the elements that exist in the set into non-existent sets.

Scenario

Zookeeper I initially used the Bloom Filter to remove duplicate crawler links. If we use the dumbest method to save all the URLs that have been crawled, the de-duplicated judgment speed will certainly decrease as the data grows and the memory consumption will also increase, even if the Digest algorithm and hash Storage are used, this trend is only slowed down.
Zookeeper I need to find a method that is still fast and consumes less memory even when there are many URLs. Therefore, the use of the Bloom Filter and the wrong judgment cost of the Bloom Filter are completely acceptable for my application scenario to only capture a few pages.

Principle

The   Bloom Filter only maintains a m-bit BitArray (Bit Array). At first, all m-bit values are zero. Constantly record elements (such as URLs that have been crawled). This is only the process in which some positions in the m-bit BitSet are set from 0 to 1.
In addition, Bloom Filter requires K different hash functions, and the result of each hash function is 0 ~ M-1 of each hash function is mapped to the I-bit of the bid.

Record Element

Then let's take a look at the specific process of inserting a string into the Bloom Filter. This is the result of calculating this string 'str' through K different hash functions: h1, h2, and hK. Then, set 1 in the h1, h2, and hK positions of BitArrray.
  

Judgment Element

Then how can we determine whether a string 'str' exists? You can think of this process on your own.
The operator computes the string through K hash functions to obtain h1, h2, and hK. Then, one by one judges whether the position h1, h2, and hK of BitArray is 1:

1. As long as one character is not 1, it indicates that this string has never been recorded by the Bloom Filter. 2. If all strings are 1, this string may have been recorded by the Bloom Filter. (Why cannot it be 100%? You must have thought of it.) This is the source of the incorrect determination of the Bloom Filter.

The principle of Bloom Filter is so simple that you can program a BloomFilter by yourself. The only problem is how to reduce the error determination rate.

Factors Affecting false positive rate

Reduce the false positive rate of the Bloom Filter to make it acceptable. BloomFilter is of course your weapon. What are the factors that affect it?

1. Number of BITs of BitArray M2.hash functions K3. quality of each hash Function

The relationship between M, K, and the number N of elements to be recorded can minimize the false positive rate.
With the above, we can implement our own BloomFilter.

Python BloomFilter Library

Of course, the omnipotent Python already has the Bloom Filter library. Install pip.

>>> from pybloom import BloomFilter>>> dir(BloomFilter)['FILE_FMT', '__and__', '__class__', '__contains__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getstate__', '__hash__', '__init__', '__len__', '__module__', '__new__', '__or__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_setup', 'add', 'copy', 'fromfile', 'intersection', 'tofile', 'union']

Let's take a look at the usage of _ init _

>>> print BloomFilter.__init__.__doc__    Implements a space-efficient probabilistic data structure    Implements a space-efficient probabilistic data structure    capacity        this BloomFilter must be able to store at least *capacity* elements        while maintaining no more than *error_rate* chance of false        positives    error_rate        the error_rate of the filter returning false positives. This        determines the filters capacity. Inserting more than capacity        elements greatly increases the chance of false positives.    >>> b = BloomFilter(capacity=100000, error_rate=0.001)    >>> b.add("test")    False    >>> "test" in b    True

Two parameters: capacity and error_rate
Capacity is the volume of the bloom filter. The maximum number of elements that can be recorded
Error_rate: Error Rate
Given these two parameters, You can initialize the filter. At the same time, he also gave an instance.
Simple use of demo, in order to facilitate observation

>>> b = BloomFilter(capacity=10, error_rate=0.1)>>> b.bitarraybitarray('000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000')>>> b.num_bits96>>> s = "http://blog.csdn.net/TENLIU2099/article/details/78288912">>> b.add(s)False>>> b.bitarraybitarray('010000000000000000000000000100000000000000000000000010000000000000000000000010000000000000000000')>>> s in bTrue

Come here first.

Welcome to my blog
Top
0
Step on
0
View comments

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.