Big Data Filtering and judgment algorithms-Bitmap/Bloomfilter

Source: Internet
Author: User

Today, a student asked me some questions about big data, one of which is representative, such as determining whether a url is in the Set, 10 URLs, and whether a url is in the set, for example ~ 1 million consecutive unordered numbers, random N numbers, and so on. This type of problem requires a large data set, and each data unit is small, such as an int. To a large extent, this type of problem can be done using Bitmap or Bloomfilter. The basic idea is to open up a large memory, and then use eight bits in a byte to implement tag elements by bit. Because the address space is continuous, so the search is O (1. What needs to be said here is that the determination of BloomFilter does not belong to a set. In theory, there is a misjudgment. If 100% data is required to be correct, do not use BloomFilter.
As the name suggests, Bitmap is a memory. Memory is a continuous Bitmap. Each bit passes through 0 and 1 to indicate whether or not an element exists. For example, a number whose number is N corresponds to Bitmap, which is the byte N/8, and is mapped to the 01 bits N % 8. Therefore, by detecting the corresponding bit, you can know that the data is not in the Set and the data is correct. Directly run the Code:
[Cpp]
# Include <cstdlib>
# Include <iostream>
# Include <algorithm>
# Include <vector>
# Include <stddef. h>
 
# Include <memory. h>
 
# Define BYTES 12500
 
Int main ()
{
Srand (unsigned int) time (NULL ));
 
Size_t total_numbers = 100000;
 
Typedef std: vector <int> SetContainer;
Typedef std: vector <int >:: iterator SetIterator;
 
SetContainer numbers;
Numbers. reserve (total_numbers );
 
Int r1 = rand () % total_numbers;
Int r2 = r1 + 1000;
 
// Generate total_numbers-2 numbers
For (int I = 0; I! = Total_numbers; ++ I ){
If (I! = R1 & I! = R2)
Numbers. push_back (I );
}
 
Std: cout <"[" <numbers. size () <"] insert OK ";
Std: cin. get ();
 
// Shuffle
Std: random_shuffle (numbers. begin (), numbers. end ());
 
Unsigned char * bitmap = (unsigned char *) malloc (BYTES );
Memset (bitmap, 0, BYTES );
For (SetIterator itr = numbers. begin (); itr! = Numbers. end (); ++ itr ){
Ptrdiff_t forward = (* itr)/8;
Size_t offset = (* itr) % 8;
Bitmap [forward] | = (0x80UL> offset );
}
 
Std: cout <"Bitmap build OK ";
Std: cin. get ();
 
For (int j = 0; j! = BYTES; ++ j ){
If (bitmap [j]! = 0xFF ){
Std: cout <"FIND ";
Unsigned long num = j * 8;
Unsigned char check = bitmap [j];
Unsigned char bit = 0;
While (bit! = 8 ){
If (0 = (check & (0x80UL> bit )))
Std: cout <"[" <(num + bit) <"]";
Bit ++;
}
Std: cout <std: endl;
}
}
 
Std: cout <"DONE ";
 
Std: cin. get ();
 
Free (bitmap );
 
Return 0;
}

BloomFilter is a binary vector data structure proposed by Howard Bloom in 1970. It is suitable for reading more data than Bitmap:
1. initialize a large memory to store the 01 flag:

2. By using N hash functions (N = 3), Hash the same value multiple times and map it to Bloomfilter like Bitmap,

 
3. During the detection, the system also uses N hashes to find the ing bits, and maintains the inclusion relationship at the detection if every bit of the ing is 1. As mentioned above, BloomFilter may have misjudgment. The probability of misjudgment depends on the number of Hash functions, the probability of Hash function collision, and the memory size opened by Bloomfilter. An appropriate value is required for the number of Hash functions, which may cause efficiency problems. If the number of Hash functions is large, it may lead to incorrect judgment. The theory is 5 ~ Between 10, 3 ~ is used in the project ~ Five, depending on your needs.

Code:

[Cpp]
# Include <cstdlib>
# Include <cstdio>
# Include <iostream>
# Include <algorithm>
# Include <vector>
# Include <stddef. h>
 
# Include <memory. h>
 
# Define BLOOM (1024UL * 1024UL * 1024UL) // 1G
# Define HASH_RESULT 3
 
Typedef unsigned char BloomFilter;
 
Typedef struct _ hash_result {
Size_t N; // how many results
Size_t result [0];
} HashResult;
 
/* Brian Kernighan & Dennis Ritchie hashfunction, used in Java */
Size_t BKDR_hash (const char * str)
{
Register size_t hash = 0;
While (size_t ch = (size_t) * str ++ ){
Hash = hash * 131 + ch;
}
Return hash;
}
 
/* Unix System Hashfunction, also used in Microsoft's hash_map */
Size_t FNV_hash (const char * str)
{
If (! * Str)
Return 0;
Register size_t hash = 2166136261;
While (size_t ch = (size_t) * str ++ ){
Hash * = 16777619;
Hash ^ = ch;
}
Return hash;
}
 
/* Donald Knuth Hashfunction, presented in book <Art of Computer Programming> */
Size_t DEK_hash (const char * str)
{
If (! * Str)
Return 0;
Register size_t hash = 1315423911;
While (size_t ch = (size_t) * str ++ ){
Hash = (hash <5) ^ (hash> 27) ^ ch;
}
Return hash;
}
 
Typedef size_t (* HASH_FUNC) (const char *);
 
HASH_FUNC HASH [] = {
BKDR_hash, FNV_hash, DEK_hash
};
 
 
Void bloom_filter_mark (BloomFilter * bf, const char * v)
{
HashResult * hr = (HashResult *) calloc (1, sizeof (HashResult) + (sizeof (size_t) * HASH_RESULT ));
 
For (int I = 0; I! = HASH_RESULT; ++ I ){
Hr-> result [I] = (HASH [I] (v) % BLOOM;
// Set the binary bit to 1
Bf [hr-> result [I]/8] | = 0x80UL> (hr-> result [I] % 8 );
// Printf ("** % lu | hash-% d [% lu] | offset [% X] \ n", HASH [I] (v), I, hr-> result [I], bf [hr-> result [I]/8]);
}
 
Free (hr );
}
 
Bool bloom_filter_check (BloomFilter * bf, const char * v)
{
HashResult * hr = (HashResult *) calloc (1, sizeof (HashResult) + (sizeof (size_t) * HASH_RESULT ));
 
Size_t in = HASH_RESULT;
For (int I = 0; I! = HASH_RESULT; ++ I ){
Hr-> result [I] = HASH [I] (v) % BLOOM;
// Printf ("** % lu | % X \ n", hr-> result [I], bf [hr-> result [I]/8]);
// Check this bit is "1" or not
If (bf [hr-> result [I]/8] & (0x80UL> (hr-> result [I] % 8 )))
In --;
}
 
Free (hr );
Return in = 0;
 
}
 
Int main ()
{
// Std: cout <BKDR_hash ("0") <std: endl;
// Std: cout <DEK_hash ("0") <std: endl;
// Std: cout <FNV_hash ("0") <std: endl;
 
BloomFilter * bloom = new (std: nothrow) BloomFilter [BLOOM];
If (NULL = bloom)
Printf ("No Space to build BloomFilter \ n"), exit (0 );
 
Printf ("BloomFilter Calloc Memory OK \ n ");
 
For (int I = 0; I! = 1000000; I ++ ){
Char buf [16] = {0 };
Sprintf (buf, "% d", I );
Bloom_filter_mark (bloom, buf );
}
Printf ("BloomFilter Build OK \ n ");
 
For (int I = 999995; I! = 1000010; I ++ ){
Char buf [16] = {0 };
Sprintf (buf, "% d", I );
If (bloom_filter_check (bloom, buf ))
Printf ("[FOUND] % d \ n", I );
}
 
Delete bloom;
 
Return 0;
}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.