Bloom filter was proposed by bloom in 1970 and was initially widely used in spelling checks and database systems. In recent years, with the development of computer and Internet technologies, the continuous expansion of data sets has led to the emergence of Bloom Filters, and various new applications and variants have emerged. Bloom filter is a data structure with high spatial efficiency. It consists of a single-digit group and a group of hash ing functions. Bloom filter can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time far exceed the average algorithm. Its disadvantage is that it has a certain false recognition rate and difficulty in deleting the element.
Basic Principles
Finding or judging whether an element exists in a specified set is a common problem in computer science. Generally, we store all elements in a linear table (array or linked list), tree (Binary Tree, heap, red/black tree, B +/B-/B * tree), and other data structures, and sort and search on it. Here, the search time complexity is usually O (n) or O (logn). If the set element is very large, not only the query speed is very slow, but the memory space is also very high. Assuming there are 1 billion elements, each element node occupies n Bytes, the storage of this set roughly requires n GB of memory. You may soon think of hashtable. Its search time complexity is O (1). It can map and index elements, but it does not reduce the memory demand. A problem with the hash function is that two different elements may generate the same hash value. In some cases, exact comparison is required to solve this problem.
In fact, to determine whether an element exists in a specified set, you do not need to save the original information of all set elements. You only need to remember the "existence state, this usually requires only a few bits for representation. The hash function maps an element to a point in a single-digit group. To reduce the collision rate, multiple hash functions can be used to map the elements to multiple points. In this way, you only need to check if several points are 0 or 1 to determine whether an element exists in the set. This is the basic idea of bloom filter, which can not only greatly reduce the memory space, but also quickly search.
Bloom filter uses a single-digit group to record the existence status of an element, and uses a group of hash functions (H1, H2, HK...) to map the bit of an element. When an element is inserted, K hash calculations are performed on the element, and the corresponding bit mapped to the in-place array is set to 1. When an element is searched, if one of the mappings is 0, the element does not exist in the set. If all the mappings are 1, the element may exist in the set. In other words, if the bloom filter judges that an element is not in the Set, it certainly does not exist. If it determines that an element exists, it does not exist, although this probability is very low. This problem is determined by the collision feature of the hash function, which produces the bloom filter error rate. This error rate can be adjusted by changing the size of the bloom filter array or changing the number of hash functions. It can be seen that the bloom filter is not perfect, and its efficiency also has a certain price. It tolerates a certain error rate in exchange for a huge savings in storage space. In addition, Bloom filter does not support deletion of elements. Deletion affects the existence of other elements. Therefore, the bloom filter is not suitable for applications with "Zero errors", but this error is positive (false positive) and will not cause a reverse error (false negative ), it is absolutely correct to judge that the element does not exist in the collection. Bloom Filters use controllable error rates to greatly save space and provide extremely fast search performance, which is also widely used.
Math Basics
(1) Error Rate Estimation
Assume that kN is <m, and hash is completely random. k indicates the number of hash functions, N indicates the number of projects, and M indicates the number of bit arrays. When all items are mapped to the M-Bit Array by K hash functions, the probability of a person still being 0 is (1-1/m) ^ (kN) = e ^ (-kN/m), the error rate is about
F = (1-(1-1/m) ^ (kN) ^ k = (1-e ^ (-kN/m) ^ K
P = e ^ (-kN/m), given m, n, then
F = (1-p) ^ K = e ^ (Kln (1-p ))
(2) optimal number of Hash Functions
So G = Kln (1-p), when G is extremely small, F takes the minimum value, because Ln (E ^ (-kN/m) =-kN/m,
G =-M/N * ln (p) * ln (1-p)
According to the symmetry principle, when P = 1/2, G obtains the minimum value. Therefore,
P = e ^ (-kN/m) = 1/2, get,
K = ln2 * (M/N). At this time, the error rate is the smallest, that is, F = (1/2) ^ k = (0.618) ^ (M/N)
(3) Bit Array size
Given the error rate e, the reference 4 calculates that F does not exceed e when m> = N * log2 (1/E. We have calculated above that F is the smallest when k = ln2 * (M/N), so when the number of hash functions is the optimum, in order to make the error rate not greater than E, there are
M> = log2 (e) * (N * log2 (1/E), which is 1.44 times the minimum m value under normal conditions.
Based on the mathematical formula obtained above, assuming that the error rate is 0.01, we can determine m> = 9.567n, K = 7 in the case of optimization.
Basic Features
From the above analysis of the basic principles and mathematical basics, we can obtain the following basic features of the bloom filter to guide practical application.
(1) There is a certain error rate, which occurs in positive judgment (existence) and no error in reverse judgment (non-existence );
(2) the error rate is controllable and can be adjusted by changing the bit array size, number of hash functions, or hash functions with a lower collision rate;
(3) keep a low error rate, and keep at least half of the bit array space;
(4) Given M and N, you can determine the optimal number of hash values, that is, K = ln2 * (M/N), with the minimum error rate;
(5) Given the allowed error rate e, you can determine the appropriate bit array size, that is, m> = log2 (e) * (N * log2 (1/E )), then determine the number of hash functions K;
(6) The forward error rate cannot be completely eliminated. Even if the bitwise array size and the number of hash functions are not limited, the zero error rate cannot be achieved;
(7) high space efficiency: only the "existing state" is saved, but the complete information cannot be stored. Other data structures are required for secondary storage;
(8) Deletion of elements is not supported because deletion security cannot be guaranteed.
Advantages and disadvantages
Compared with other data structures, the biggest advantage of bloom filter is the complexity of space efficiency and search time. Its storage space and insertion/query time are constant. There is no correlation between hash functions, which can be conveniently implemented in parallel by hardware. Bloom filter does not need to store the elements themselves. It is advantageous in some scenarios where the confidentiality requirements are very strict. In addition, the bloom filter can generally represent the complete set of big datasets, and it is difficult to implement any other data structure.
The disadvantage of the bloom filter is as obvious as the advantage. The first is the error rate. As the number of inserted elements increases, the error rate also increases. Although the error rate can be reduced by increasing the bit array size or the number of hash functions, it also affects space efficiency and search performance, and this error rate cannot be fundamentally eliminated. This makes it impossible to apply the bloom filter when "no error" is required. In general, the element cannot be deleted from the bloom filter. On the one hand, we cannot ensure that the deleted elements must exist in the bloom filter. On the other hand, we cannot safely delete elements, which may affect other elements, the reason is the possible collision of hash functions. The counting bloom filter supports deletion of elements to a certain extent, but it is not so easy to safely delete elements. It cannot fundamentally solve this problem, and there will also be problems in counter rewinding. These two aspects are currently the focus of the research direction of bloom filter, there are a lot of work, so there are a lot of bloom filter variants.
Application Principles and Cases
You only need to use a list or set. If you consider space efficiency, you can use the bloom filter. In application, we need to consider the positive error rate of the bloom filter. For "zero error" applications, we need a corresponding auxiliary mechanism to eliminate the error rate. Otherwise, key services cannot be used.
Bloom filter is widely used in various fields, such as spelling check, string matching algorithm, network packet analysis tools, Web Cache, file system, storage system, etc, here we will introduce the application of bloom filter in deduplication. The basic principle of the mainstream deduplication technology is to set the length or length of the file, and then use the hash function to calculate the data block fingerprint, if the two data blocks have the same fingerprint, they are considered to be duplicate data blocks (there is also a data collision problem here). Only one copy of the data block can be saved. Other identical data blocks are represented by the copy index number, this reduces storage space and improves storage efficiency. To query whether a data block already exists or already exists, you need to calculate the data block fingerprint and search for it, and record the fingerprints of all unique data blocks. For example, for 32 TB data, the average data block size is 8 KB, each data block uses MD5 and sha1 to calculate two fingerprints and uses 64-bit Integers to indicate that the unique block number occupies 44 bytes (128 + 160 + 64)/8 ), A total of 176 GB (32 TB/8 KB * 44 bytes) of storage space is required to store data block information. The data capacity of the de-duplication system is usually dozens to hundreds of TB. If you store all the data block information in the memory, it is obvious that the memory demand is huge, for cost consideration, this is unrealistic for commercial products. Therefore, in order to compromise the cost and performance, the common practice is to save the data block information on the disk or SSD and use a certain amount of memory as the cache data block fingerprint, use temporal locality and spatial locality to improve search performance. A key problem with this method is that if the new data block is not repeated, a cache miss will occur during the search, resulting in a large number of disk read/write operations. Because the disk or SSD performance is much smaller than the memory, it has a great impact on the search performance. The bloom filter can effectively solve this problem. The summary vector in datadomain is implemented using the bloom filter. In the preceding example, if a data block uses three hash functions to calculate a fingerprint that occupies up to three digits, the bloom filter only requires 1.5 GB = 32 TB/8 KB * 3/8 bytes of memory space, this is not a problem even for general PCs. After the bloom filter mechanism is introduced, for a new data block, we first look for the bloom filter. If it is not hit, it indicates that this is a new unique data block, you can directly Save the data block and cachr data block information. If hit occurs, it indicates that this may be a duplicate data block and needs to be confirmed through further hash or tree search, in this case, the cache and disk must be used for interaction. Benefiting from the bloom filter and cache, The datadomain system can reduce disk access by 99%, thus greatly improving the performance of data block query with a small amount of memory space.
C language implementation
Bloom filter is simple in principle, but it can always be used in a great way and is easy to implement. Here we didn't re-invent the wheel, but referenced the C language implementation in the document [5], a total of only a hundred lines of code, and there are also test routines. Complete C program visit http://en.literateprograms.org/Bloom_filter_%28C%29? Oldid = 1, 16893
/* bloom.h */#ifndef __BLOOM_H__#define __BLOOM_H__#include<stdlib.h>typedef unsigned int (*hashfunc_t)(const char *);typedef struct {size_t asize;unsigned char *a;size_t nfuncs;hashfunc_t *funcs;} BLOOM;BLOOM *bloom_create(size_t size, size_t nfuncs, ...);int bloom_destroy(BLOOM *bloom);int bloom_add(BLOOM *bloom, const char *s);int bloom_check(BLOOM *bloom, const char *s);#endif/* bloom.c */#include<limits.h>#include<stdarg.h>#include"bloom.h"#define SETBIT(a, n) (a[n/CHAR_BIT] |= (1<<(n%CHAR_BIT)))#define GETBIT(a, n) (a[n/CHAR_BIT] & (1<<(n%CHAR_BIT)))BLOOM *bloom_create(size_t size, size_t nfuncs, ...){BLOOM *bloom;va_list l;int n;if(!(bloom=malloc(sizeof(BLOOM)))) return NULL;if(!(bloom->a=calloc((size+CHAR_BIT-1)/CHAR_BIT, sizeof(char)))) {free(bloom);return NULL;}if(!(bloom->funcs=(hashfunc_t*)malloc(nfuncs*sizeof(hashfunc_t)))) {free(bloom->a);free(bloom);return NULL;}va_start(l, nfuncs);for(n=0; n<nfuncs; ++n) {bloom->funcs[n]=va_arg(l, hashfunc_t);}va_end(l);bloom->nfuncs=nfuncs;bloom->asize=size;return bloom;}int bloom_destroy(BLOOM *bloom){free(bloom->a);free(bloom->funcs);free(bloom);return 0;}int bloom_add(BLOOM *bloom, const char *s){size_t n;for(n=0; n<bloom->nfuncs; ++n) {SETBIT(bloom->a, bloom->funcs[n](s)%bloom->asize);}return 0;}int bloom_check(BLOOM *bloom, const char *s){size_t n;for(n=0; n<bloom->nfuncs; ++n) {if(!(GETBIT(bloom->a, bloom->funcs[n](s)%bloom->asize))) return 0;}return 1;}
References
[1] http://en.wikipedia.org/wiki/Bloom_filter
[2] http://www.cs.jhu.edu /~ Fabian/courses/cs600.624/slides/bloomslides.pdf
[3] http://www.eecs.harvard.edu /~ Michaelm/postscripts/im2005b.pdf
[4] http://www.partow.net/programming/hashfunctions/#BloomFilters
[5] http://en.literateprograms.org/Bloom_filter_%28C%29? Oldid = 1, 16893
[6] http://www.datadomain.com/pdf/DataDomain-Avoiding-the-Bottleneck-with-Dedupe.pdf