Hash Function
Hash Functions commonly use mod prime numbers, or use multiplication policies to retrieve certain bits. These policies directly affect the resize of hash table. If mod prime numbers are used, they can only increase by prime numbers, if the multiplication method is used, it can only be increased by 2 ^ p.
References such as [1] [2] provide many common string hash functions, but more valuable include the following two:
Murmur hash [3] [4], of which [3] cannot be accessed directly, but the examples in this article have specific implementations. For more details, murmur hash [8].
For more information about city hash, see [5].
Create a hash table
Hash Table Construction usually uses list to solve the conflict. This is the case in C ++ tr1. The implementation of tr1 uses the design method of policy base, for more information, see references [6. This reference is found in the header comment during source code browsing (/usr/include/C ++/4.4/tr1_impl/hashtable.
Resize Problems
Resize policies include:
Policy 1:
1) Copy all again,
Policy 2:
1) During resize, two hashtable tables are used. During insert, only the new hash table is inserted, and r Data in the old hash_table is placed in the new table. During query, query two hash tables simultaneously
2) if all the old hash table data has been moved, delete the old hash table.
For more information about the above two policies, see references.
Bloomfilter
Speaking of hash, we need to mention bloomfilter, which is implemented through hash. A good hash function can make bloomfilter have good performance. One of its usage methods is to add a bloomfilter in front of the database if you want to query the database. If it is not in bloomfilter, you do not need to query the database, because bloomfilter returns false, there will be no errors. Cityhash and murmurhash should be a good choice to implement bloomfilter.
Consistent Hash (Consistent hash) Consistent hasn is mainly used in Distributed Systems. When a host is added or deleted, it will not cause serious jitter, because its policy will only lead to rehash of the adjacent host, therefore, the impact is relatively small. The references [9] are clear and worth your reference. The hashtable implementation in hashtable C ++ tr1 in C ++ is unordered_map. The example and references in this article [10] Have a simple demonstration. I have also read some unordered_map implementations, hash_table is implemented internally. The references [10] provide more valuable references. The computing performance comparison code in this article is as follows:
#include "basictypes.h"#include <string>#include <vector>#include <stdlib.h>#include <stdio.h>#include <sys/time.h>#include "cityhash/include/city.h"#include <tr1/unordered_map>#include <map>// 64-bit hash for 64-bit platformsconst uint32 kFingerPrintSeed = 19820125; uint64 MurmurHash64A(const void* key, int len, uint32 seed) { const uint64 m = 0xc6a4a7935bd1e995; const int r = 47; uint64 h = seed ^ (len * m); const uint64* data = (const uint64 *)key; const uint64* end = data + (len/8); while (data != end) { uint64 k = *data++; k *= m; k ^= k >> r; k *= m; h ^= k; h *= m; } const uint8* data2 = (const uint8*)data; switch (len & 7) { case 7: h ^= static_cast<uint64>(data2[6]) << 48; case 6: h ^= static_cast<uint64>(data2[5]) << 40; case 5: h ^= static_cast<uint64>(data2[4]) << 32; case 4: h ^= static_cast<uint64>(data2[3]) << 24; case 3: h ^= static_cast<uint64>(data2[2]) << 16; case 2: h ^= static_cast<uint64>(data2[1]) << 8; case 1: h ^= static_cast<uint64>(data2[0]); h *= m; }; h ^= h >> r; h *= m; h ^= h >> r; return h;}// 32-bit hashuint32 MurmurHash32A(const void* key, int len, uint32 seed) { const uint32 m = 0x5bd1e995; const int r = 24; uint32 h = seed ^ (len * m); const uint32* data = (const uint32 *)key; while (len >= 4) { uint32 k = *(uint32 *)data; k *= m; k ^= k >> r; k *= m; h *= m; h ^= k; data += 1; len -= 4; } // Handle the last few bytes of the input array const uint8* data2 = (const uint8*)data; switch (len) { case 3: h ^= static_cast<uint32>(data2[2]) << 16; case 2: h ^= static_cast<uint32>(data2[1]) << 8; case 1: h ^= static_cast<uint32>(data2[0]); h *= m; }; // Do a few final mixes of the hash to ensure the last few // bytes are well-incorporated. h ^= h >> 13; h *= m; h ^= h >> 15; return h;}/* A Simple Hash Function */unsigned int simple_hash(char *str){register unsigned int hash;register unsigned char *p;for(hash = 0, p = (unsigned char *)str; *p ; p++)hash = 31 * hash + *p;return (hash & 0x7FFFFFFF);}/* RS Hash Function */unsigned int RS_hash(char *str){ unsigned int b = 378551; unsigned int a = 63689; unsigned int hash = 0; while (*str) { hash = hash * a + (*str++); a *= b; } return (hash & 0x7FFFFFFF);}/* JS Hash Function */unsigned int JS_hash(char *str){ unsigned int hash = 1315423911; while (*str) { hash ^= ((hash << 5) + (*str++) + (hash >> 2)); } return (hash & 0x7FFFFFFF);}/* P. J. Weinberger Hash Function */unsigned int PJW_hash(char *str){ unsigned int BitsInUnignedInt = (unsigned int)(sizeof(unsigned int) * 8); unsigned int ThreeQuarters = (unsigned int)((BitsInUnignedInt * 3) / 4); unsigned int OneEighth = (unsigned int)(BitsInUnignedInt / 8); unsigned int HighBits = (unsigned int)(0xFFFFFFFF) << (BitsInUnignedInt - OneEighth); unsigned int hash = 0; unsigned int test = 0; while (*str) { hash = (hash << OneEighth) + (*str++); if ((test = hash & HighBits) != 0) { hash = ((hash ^ (test >> ThreeQuarters)) & (~HighBits)); } } return (hash & 0x7FFFFFFF);}/* ELF Hash Function */unsigned int ELF_hash(char *str){ unsigned int hash = 0; unsigned int x = 0; while (*str) { hash = (hash << 4) + (*str++); if ((x = hash & 0xF0000000L) != 0) { hash ^= (x >> 24); hash &= ~x; } } return (hash & 0x7FFFFFFF);}/* BKDR Hash Function */unsigned int BKDR_hash(char *str){ unsigned int seed = 131; // 31 131 1313 13131 131313 etc.. unsigned int hash = 0; while (*str) { hash = hash * seed + (*str++); } return (hash & 0x7FFFFFFF);}/* SDBM Hash Function */unsigned int SDBM_hash(char *str){ unsigned int hash = 0; while (*str) { hash = (*str++) + (hash << 6) + (hash << 16) - hash; } return (hash & 0x7FFFFFFF);}/* DJB Hash Function */unsigned int DJB_hash(char *str){ unsigned int hash = 5381; while (*str) { hash += (hash << 5) + (*str++); } return (hash & 0x7FFFFFFF);}/* AP Hash Function */unsigned int AP_hash(char *str){ unsigned int hash = 0; int i; for (i=0; *str; i++) { if ((i & 1) == 0) { hash ^= ((hash << 7) ^ (*str++) ^ (hash >> 3)); } else { hash ^= (~((hash << 11) ^ (*str++) ^ (hash >> 5))); } } return (hash & 0x7FFFFFFF);}/* CRC Hash Function */unsigned int CRC_hash(char *str){ unsigned int nleft = strlen(str); unsigned long long sum = 0; unsigned short int *w = (unsigned short int *)str; unsigned short int answer = 0; /* * Our algorithm is simple, using a 32 bit accumulator (sum), we add * sequential 16 bit words to it, and at the end, fold back all the * carry bits from the top 16 bits into the lower 16 bits. */ while ( nleft > 1 ) { sum += *w++; nleft -= 2; } /* * mop up an odd byte, if necessary */ if ( 1 == nleft ) { *( unsigned char * )( &answer ) = *( unsigned char * )w ; sum += answer; } /* * add back carry outs from top 16 bits to low 16 bits * add hi 16 to low 16 */ sum = ( sum >> 16 ) + ( sum & 0xFFFF ); /* add carry */ sum += ( sum >> 16 ); /* truncate to 16 bits */ answer = ~sum; return (answer & 0xFFFFFFFF);}std::string Itoa(int value) { if (value < 0) { value *= -1; } char character[] = "0123456789abcdefghijklmnopqrstuvwxyz"; std::string res = ""; do { res += character[value % sizeof(character)]; } while ((value /= sizeof(character)) > 0); return res;}int GetTime() { timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec * 1000000 + tv.tv_usec;}class StringHash { public: uint64 operator()(const std::string& s) const { return CityHash64(s.c_str(), s.size()); // return MurmurHash64A(s.c_str(), s.size(), kFingerPrintSeed) % (unsigned int) 0xFFFFFFFF; }};class StringEqual { public: bool operator()(const std::string& left, const std::string& right) const { return left == right; }};int main(int argc, char** argv) { const int kDataSize = 1000000; std::string content = ""; std::vector<std::string> data; for (int i = 0; i < kDataSize; ++i) { content = ""; for (int j = 0; j < 10; ++j) { content += Itoa(rand()); } data.push_back(content); } //murmur test int start = GetTime(); for (int i = 0; i < kDataSize; ++i) { MurmurHash64A(data[i].c_str(), data[i].size(), kFingerPrintSeed); } printf("murmur64: %d\n", GetTime() - start); start = GetTime(); for (int i = 0; i < kDataSize; ++i) { MurmurHash32A(data[i].c_str(), data[i].size(), kFingerPrintSeed); } printf("murmur32:%d\n", GetTime() - start); //simple hash start = GetTime(); for (int i = 0; i < kDataSize; ++i) { simple_hash(const_cast<char*>(data[i].c_str())); } printf("simple hash:%d\n", GetTime() - start); // bkdr hash start = GetTime(); for (int i = 0; i < kDataSize; ++i) { BKDR_hash(const_cast<char*>(data[i].c_str())); } printf("bkdr hash:%d\n", GetTime() - start); // AP hash start = GetTime(); for (int i = 0; i < kDataSize; ++i) { AP_hash(const_cast<char*>(data[i].c_str())); } printf("AP hash:%d\n", GetTime() - start); // City hash start = GetTime(); for (int i = 0; i < kDataSize; ++i) { CityHash64(data[i].c_str(), data[i].size()); } printf("city hash:%d\n", GetTime() - start); std::tr1::unordered_map<std::string, int, StringHash, StringEqual> my_map_city; // City hash insert start = GetTime(); for (int i = 0; i < kDataSize; ++i) { my_map_city[data[i]] = i; } printf("city hash insert:%d\n", GetTime() - start); // map insert std::map<std::string, int> my_map_tree; start = GetTime(); for (int i = 0; i < kDataSize; ++i) { my_map_tree[data[i]] = i; } printf("tree map insert:%d\n", GetTime() - start); // City hash search start = GetTime(); int value = 0; for (int i = 0; i < kDataSize; ++i) { value = my_map_city[data[i]]; } printf("city hash search:%d\n", GetTime() - start); // map search start = GetTime(); for (int i = 0; i < kDataSize; ++i) { value = my_map_tree[data[i]]; } printf("tree map search:%d\n", GetTime() - start); }
References
[1] http://blog.csdn.net/liuben/article/details/5050697
[2] http://www.cnblogs.com/atlantis13579/archive/2010/02/06/1664792.html
[3] http ://Sites.google.com/site/Murmurhash/
[4] http://blog.csdn.net/wisage/article/details/7104866
[5] http://code.google.com/p/cityhash/
[6] http://gcc.gnu.org/onlinedocs/libstdc++/ext/pb_ds/index.html
[7] http://en.wikipedia.org/wiki/Hash_table
Http://en.wikipedia.org/wiki/MurmurHash
Http://hi.baidu.com/fdwm_lx/blog/item/f670e73582c8411d90ef3950.html [9]
Http://www.cnblogs.com/Frandy/archive/2011/07/26/Hash_map_Unordered_map.html [10]