Anatomy of a cloth filter

Source: Internet
Author: User

The bron filter (Bloom filter) was presented by Bron (Burton Howard Bloom) in 1970. It is actually made up of a long binary vector and a series of random mapping functions , and the Bron filter can be used to retrieve whether an element is in a collection. Its advantage is that space efficiency and query time are far more than the general algorithm, the disadvantage is that there is a certain rate of false recognition (false positive example false positives, that is, Bloom filter reports that an element exists in a set, but in fact the element is not in the collection) and delete difficulties, However, there is no case to identify the error (that is, false inverse negatives, if an element does not exist in the collection, then Bloom Filter will not report that the element exists in the collection, so it will not be omitted).

i.e. Bron: Present, inaccurate (hash conflict) does not exist: accurate

Improvement: the more mapping bits, the more space occupied, the lower the rate of miscarriage.

can use count to reach delete function

Bron bottom: Use bitmaps.

Principle

If you want to determine whether an element is in a collection, it is common to think of saving all the elements in the collection and then determining by comparison. Lists, trees, hash tables (also known as hash tables, hash table) and other data structures are this way of thinking. But as the elements in the collection increase, we need more storage space. At the same time, the retrieval speed is getting slower.

Bloom filter is a kind of spatial efficient random data structure, Bloom filter can be regarded as the extension of bit-map, its principle is:

When an element is added to a collection, the element is mapped to one by a K Hash 函数 位阵列(Bit array)中的 K 个点 1 . When retrieving, we just have to look at whether these points are all 1 (about) knowing that there is no it in the collection:

    • If these points have any one 0, then the retrieved element must not be in ;

    • If all is 1, then the retrieved element is likely to be

Advantages

Its advantages are 空间效率 and 查询时间 are far more than the general algorithm, Bron filter storage space and insert/query time are constants O(k) . In addition, the hash function is not related to each other, which is convenient for hardware parallel implementation. The Bron filter does not need to store the elements themselves and has an advantage in some cases where confidentiality is a very stringent requirement.

Disadvantages

But the disadvantages and advantages of the Bron filter are just as obvious. The error rate is one of them. As the number of elements deposited increases, 误算率 it grows. But if the number of elements is too small, the use of a hash table is sufficient.

(The remedy for miscarriage of judgment is to create a small white list that stores information that may be misjudged.) )

In addition, it is generally not possible to remove elements from the Bron filter 删除 . It is easy to think of changing the bit array into an array of integers, each inserting an element corresponding to the counter plus 1, so that when the element is deleted, the counter is lost. However, it is not so easy to ensure that elements are safely removed. First we must make sure that the deleted elements are indeed inside the Bron filter. This is not guaranteed by this filter alone. In addition, the counter wrapping can also cause problems.

The simulation is implemented as follows:

#pragma  once#include<iostream> #include <vector>using namespace std;class bitmap Stores data in corresponding bits, using bits to store data {Public:bitmap (Size_t len) {int size = len >> 5;if   (len % 32) _array.resize (size + 1); else_array.resize (size);} BitMap (Size_t minlen, size_t maxlen)//If using this, the subscript (Num-minlen)/32{int size =  ( maxlen - minlen + 1)  >> 5;if  (maxlen - minlen +  1)  % 32) _array.resize (size + 1); else_array.resize (size);} Void set (size_t num) {size_t index = num >> 5;size_t count =  num % 32;_array[index] |=  (1 << count);//Will _array[index] The count position is 1, There is a relationship between the storage and the size end}void reset (size_t num) {size_t index = num >> 5;size_t  count = num % 32;_array[index] &=  (!) ( 1&nbsp(<< count));//Will _array[index] The count position is 1, where the storage and size end has a relationship}bool test (size_t num) {size_t  index = num >> 5;size_t count = num % 32;return   _array[index] &  (1 << count);} private:vector<int> _array;//with vector<char> cannot store the same number, there is a limit because it is only 0, 12 different bits};class hashfunc1{ Size_t bkdrhash (CONST&NBSP;CHAR*&NBSP;STR) {register size_t hash = 0;while  (size_ t ch =  (size_t) *str++) {hash = hash * 131 + ch;} Return hash;} Public:size_t operator () (String key) {Return bkdrhash (Key.c_str ());}}; Class hashfunc2{size_t sdbmhash (const char* str) {register size_t hash =  0;while  (size_t ch =  (size_t) *str++) {hash = 65599 * hash +  ch;} Return hash;} Public:size_t operator () (String key) {Return sdbmhash (Key.c_str ());}; Class hashfunc3{size_t rshash (CONST&NBSP;CHAR*&NBSP;STR) {register size_t hash =  0;size_t magic = 63689;while  (size_t ch =  (size_t) *str++) {hash =  hash * magic + ch;magic *= 378551;} Return hash;} Public:size_t operator () (String key) {Return rshash (Key.c_str ());}}; Class hashfunc4{size_t aphash (CONST&NBSP;CHAR*&NBSP;STR) {register size_t hash =  0;size_t ch;for  (long i = 0; ch =  (size_t) *str++; i++) {if  (( i & 1)  == 0) {hash ^=  ((hash << 7)  ^ ch ^   (hash >> 3));} else{hash ^=  (~ (hash << 11  ^ ch ^  (hash >> 5)) );}} Return hash;} Public:size_t operator () (String key) {Return aphash (Key.c_str ());}}; Class hashfunc5{sizE_t jshash (CONST&NBSP;CHAR*&NBSP;STR) {if  (!*STR)         //   This is added by myself to ensure that the empty string returns a hash value 0  return 0;register size_t hash = 1315423911; while  (size_t ch =  (size_t) *str++) {hash ^=  ((hash << 5)  +  ch +  (HASH&NBSP;&GT;&GT;&NBSP;2));} Return hash;} Public:size_t operator () (String key) {Return jshash (Key.c_str ());}}; TEMPLATE&LT;CLASS&NBSP;K,&NBSP;CLASS&NBSP;FUNC1&NBSP;=&NBSP;HASHFUNC1,CLASS&NBSP;FUNC2&NBSP;=&NBSP;HASHFUNC2, class func3 = hashfunc3,class func4 = hashfunc4,class func5 =  Hashfunc5>class bloomfilter{public:bloomfilter (size_t cap = 100): _bitmap (CAP),  _ Capacity (CAP) {}void set (Const k& key) {size_t index1 = func1 () (key); _bitmap. Set (index1%_capacity); Size_t index2 = func2 () (key); _bitmap. Set (index2%_capacity); size_t index3 = func3 () (key); _bitmap. Set (index3%_capacity); Size_t index4 = func4 () (key); _bitmap. Set (index4%_capacity); Size_t index5 = func5 () (key); _bitmap. Set (index5%_capacity);cout << index1 <<  " "  << index2  <<  " "  << index3<<  " "  << index4  <<  " " &NBSP;&LT;&LT;&NBSP;INDEX5&NBSP;&LT;&LT;&NBSP;ENDL;} Bool test (Const k& key) {if  (!_bitmap. Test (FUNC1 () (key)%_capacity)) return false;if  (!_bitmap. Test (FUNC2 () (key)  % _capacity)) return false;if  (!_bitmap. Test (Func3 () (key)  % _capacity)) return false;if  (!_bitmap. Test (Func4 () (key)  % _capacity)) return false;if  (!_bitmap. Test (Func5 () (key)  % _capacity)) Return false;return true;} protected:bitmap _bitmap;size_t _capacity;}; Void test1 () {BloOmfilter<string> b;b.set ("http://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html"); B.set ("http ://www.cnblogs.com/-clq/archive/2012/05/31/2528154.html "); B.set (" http://www.cnblogs.com/-clq/archive/2012/05/ 31/2528155.html "), B.set (" http://www.cnblogs.com/-clq/archive/2012/05/31/2528156.html "); B.set (" http:// Www.cnblogs.com/-clq/archive/2012/05/31/2528157.html "); Cout << b.test (" http://www.cnblogs.com/ -clq/archive/2012/05/31/2528153.html ")  << endl;cout << b.test (" HTTP/ Www.cnblogs.com/-clq/archive/2012/05/31/2528154.html ")  << endl;cout << b.test ( "Http://www.cnblogs.com/-clq/archive/2012/05/31/2528155.html")  << endl;cout <<  b.test ("http://www.cnblogs.com/-clq/archive/2012/05/31/2528156.html")  << endl;cout  << b.test ("http://www.cnblogs.com/-clq/archive/2012/05/31/2528157.html")  << endl; Cout << b.test ("HttP://www.cnblogs.com/-clq/archive/2012/05/31/2528158.html ")  << endl;} 

650) this.width=650; "Src=" Http://s4.51cto.com/wyfs02/M01/84/97/wKiom1eVtr6Q5CfEAABlu1W3CxY504.png-wh_500x0-wm_3 -wmp_4-s_1019204165.png "title=" qq picture 20160725144958.png "alt=" Wkiom1evtr6q5cfeaablu1w3cxy504.png-wh_50 "/>

Note

Example

It can be quickly and spatially efficient to judge whether an element belongs to a set, to implement a data dictionary, or to set the intersection of a collection.

For example: Google Chrome uses bloom filter to identify malicious links (the ability to use less storage space to represent a larger set of data, simply thinking that each URL can be mapped to a bit)
And the rate of miscarriage is below one out of 10,000.
Another example: Detection of junk e-mail


This article is from the "Small Stop" blog, please be sure to keep this source http://10541556.blog.51cto.com/10531556/1829654

Anatomy of a cloth filter

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.