1. Levledb bloomfilter Storage format
In the LEVELDB 1.4 release, support for Bloomfilter was added so that the block portion of the Bloom filter can be read directly during the invocation of the Db::get () method. This reduces the number of sstable file random read operations that do not exist for key.
The filter block in LEVELDB is stored in the Meta block section, and the current version of the Meta block only has the current Bloom filter, and subsequent versions may also add new content. As shown in.
For the storage of the bloom filter in the meta block, as shown in.
[Filter 0]
[Filter 1]
[Filter 2]
...
[Filter N-1]
[Offset of filter 0]: 4 bytes
[Offset of filter 1]: 4 bytes
[Offset of filter 2]: 4 bytes
...
[Offset of filter N-1]: 4 bytes
[offset of beginning of offset array]: 4 bytes
LG (Base): 1 byte
First there is a base, the size is in the way of LG storage, the default is 2Kb, then in the data store [I*base, (i+1) *base) This part of the data is mapped to filter I, you can directly calculate the value of I, and then get to offset of Beginning of offset array, you can get the offset of filter I and filter i+1, which is the contents of the Bloom filter of this part. Table::internalget will first use the filter to determine whether the key is match, if it does not match the direct return, do not need to read the corresponding block, the code in the/table/table.cc.
2. Bloomfilter Construction algorithm
The concrete construction algorithm of Bloom Fliter in/util/bloom.cc.
From the code created by the bloom.cc, it can be seen that the memory occupied by Bloom Fliter is determined by the N (number of keys) and Bits_per_key_ parameters. And in the entire leveldb Bloom Fliter occupies memory, should be all open sstable memory and, open sstable file number is max_open_files to specify, default is 1000. Thus the memory Bloom Fliter in the entire leveldb is determined by the number of all open keys and the Bits_per_key_ specified by Keyt. A million keys and you use the suggested of bits per key as the argument to Newbloomfilterpolicy, the memory usage would be Approximately million bits =~ 1.25 MB.
3, Bloomfilter hash algorithm
Bloom hash uses the value of k_ hash function, k_ between 1~30, calculated by BITS_PER_KEY_*LN (2). These hash functions are calculated by Bloomhash and then shifted to each other.
The calculation method of Bloomhash is similar to that of MurmurHash. The code is shown below,
4. MurmurHash algorithm
MurmurHash is a non-cryptographic hash function that is suitable for general hash retrieval operations. Invented by Austin Appleby in 2008
MURMURHASH2 can produce a 32-bit or 64-bit hash value. MurmurHash is used in several open source projects, including LIBSTDC, libmemcached, Nginx, Hadoop, and more.
5. References
Http://leveldb.googlecode.com/svn/trunk/doc/table_format.txt
Https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash2.cpp
http://duanple.blog.163.com/blog/static/7097176720123227403134/
Http://zh.wikipedia.org/wiki/Murmur%E5%93%88%E5%B8%8C
https://code.google.com/p/leveldb/source/detail?r=85584d497e7b354853b72f450683d59fcf6b9c5c
Bloomfliter and murmur hash algorithm in level db