I. Algorithms to solve problems
A distributed hash (DHT) implementation algorithm, proposed by MIT in 1997, was designed to address hot spot problems in the Internet, with a similar intent to carp. The consistent hash corrects the problem caused by the simple hashing algorithm used by carp, so that distributed hashing (DHT) can be really applied in the peer-to-peer environment.
The consistency hash algorithm proposes four definitions for determining the good or bad hash algorithm in a dynamically changing cache environment:
1, balance (Balance): The balance is that the result of the hash can be distributed as far as possible in all buffers, so that all buffer space can be exploited. Many hashing algorithms can satisfy this condition.
2. monotonicity (monotonicity): monotonicity means that if some content has been allocated to the corresponding buffer by hashing, a new buffer is added to the system. The result of the hash should be to ensure that the original allocated content can be mapped to an existing or new buffer without being mapped to another buffer in the old buffer collection.
3, dispersion (Spread): In a distributed environment, the terminal may not see all the buffers, but only to see part of it. The end result is that the same content is mapped to different buffers by different endpoints when the terminal wants the content to be mapped to buffering through a hashing process, because the buffer range seen by different terminals may be different, resulting in inconsistent results for the hash. This is obviously something that should be avoided because it causes the same content to be stored in different buffers, reducing the efficiency of the system's storage. The definition of dispersion is the severity of the above-mentioned situation. A good hashing algorithm should be able to avoid inconsistencies as far as possible, that is, to minimize dispersion.
4. load : The load problem is actually looking at the dispersion problem from another perspective. Since different terminals may map the same content to different buffers, it is possible for a particular buffer to be mapped to different content by different users. As with dispersion, this situation should also be avoided, so a good hashing algorithm should be able to minimize the buffering load.
In distributed cluster, it is the most basic function of distributed cluster Management to add or remove machine, or automatically leave the cluster after machine failure. if the use of commonly used hash (object)%n algorithm, then after the machine is added or deleted, many of the original data can not be found , which seriously violates the monotony principle. The next step is to explain how the consistent hashing algorithm is designed.
Two. Design method
The data is mapped into a large space using a hash function (such as a md5,murmurhash algorithm). When the data is stored, a hash value is obtained, corresponding to each position in the ring, such as the K1 corresponds to the position shown in the figure, then a machine node B is found clockwise, and the K1 is stored in the Node B.
If the b node goes down, the data on B falls to the C node, as shown in:
In this way, only the C node is affected and the data of other nodes a,d is not affected. However, this will create an "avalanche" situation, the C node due to bear the B-node data, so the C node load will be high, C node is easy to go down, so in turn, so that the entire cluster is hung.
To this end, the concept of "virtual node" is introduced: that is, there are many "virtual nodes" in this ring, the storage of data is to find a virtual node in the clockwise direction of the ring, each virtual node will be associated to a real node, as used:
The figure of A1, A2, B1, B2, C1, C2, D1, D2 are virtual nodes, machine a load storage A1, A2 data, machine B load Storage B1, B2 data, machine C load Storage C1, C2 data. Because these virtual nodes are large in number and evenly distributed, they do not cause "avalanche" phenomena.
Three. Code implementation
MurmurHash algorithm, non-cryptographic hash algorithm, performance is very high, than the traditional CRC32, Md5,sha-1 (these two algorithms are cryptographic hash algorithm, the complexity itself is very high, resulting in the performance of the damage is inevitable), such as hash algorithm is much faster, And the collision rate of this algorithm is very low. So here we use the MurmurHash algorithm
Header file Consistent_hash.h
#include <map>using namespace STD;classconsistenthash{ Public: Consistenthash (intNode_num,intVirtual_node_num); ~consistenthash ();voidInitialize (); size_t Getserverindex (Const Char* key);voidDeletenode (Const intindex);voidAddnewnode (Const intindex);Private: map<uint32_t,size_t>Server_nodes_;//Virtual node, key is a hash value, value is the machine's index intNode_num_;//number of real machine nodes intVirtual_node_num_;//number of virtual nodes associated with each machine node};
Implementing File Consistent_hash.cpp
#include <map>#include <string.h>#include <sstream>#include "consistent_hash.h"#include "murmurhash3.h"using namespace STD; Consistenthash::consistenthash (intNode_num,intVirtual_node_num) {node_num_ = Node_num; Virtual_node_num_ = Virtual_node_num;} Consistenthash::~consistenthash () {server_nodes_.clear ();}voidConsistenthash::initialize () { for(intI=0; i<node_num_; ++i) { for(intj=0; j<virtual_node_num_; ++J) {StringStreamNode_key; node_key<<"shard-"<<i<<"-node-"<<j; uint32_t partition = Murmur3_32 (Node_key.str (). C_STR (),strlen(Node_key.str (). C_STR ())); Server_nodes_.insert (pair<uint32_t, size_t> (partition, i)); }}}size_t Consistenthash::getserverindex (Const Char* key) {uint32_t partition = Murmur3_32 (Key,strlen(key)); map<uint32_t, size_t>:: Iterator it = Server_nodes_.lower_bound (partition);//A virtual node that is greater than or equal to key is found clockwise along the ring if(It = = Server_nodes_.end ())//Not found{returnServer_nodes_.begin ()->second; }returnIt->second;}voidConsistenthash::D Eletenode (Const intIndex) { for(intj=0; j<virtual_node_num_; ++J) {StringStreamNode_key; node_key<<"shard-"<<index<<"-node-"<<j; uint32_t partition = Murmur3_32 (Node_key.str (). C_STR (),strlen(Node_key.str (). C_STR ())); map<uint32_t,size_t>:: Iterator it = Server_nodes_.find (partition);if(It! = Server_nodes_.end ()) {server_nodes_.erase (IT); } }}voidConsistenthash::addnewnode (Const intIndex) { for(intj=0; j<virtual_node_num_; ++J) {StringStreamNode_key; node_key<<"shard-"<<index<<"-node-"<<j; uint32_t partition = Murmur3_32 (Node_key.str (). C_STR (),strlen(Node_key.str (). C_STR ())); Server_nodes_.insert (pair<uint32_t, size_t> (partition, index)); }}
All source code: Https://github.com/zhihaibang/benson-style/tree/master/C++/consistent_hash
Four. Test results
Suppose there are 10,000 samples, 10 test values (0-9).
There are 5 true nodes in the consistency hash (index is 0-4), and each node is associated with 100 virtual nodes.
The test results are as follows:
Consistent hash initialize success, node_num=5, virtual_num= -Key =4,Index=3Key =7,Index=3Key =0,Index=3Key =1,Index=3Key =6,Index=3Key =3,Index=3Key =5,Index=0Key =2,Index=4Key =8,Index=2Key =9,Index=0Node error,Index=3Key =3,Index=4Key =7,Index=2Key =4,Index=1Key =0,Index=2Key =1,Index=1Key =6,Index=4Node Recover,Index=3Key =4,Index=3Key =3,Index=3Key =0,Index=3Key =1,Index=3Key =7,Index=3Key =6,Index=3Index=0, Data_count =5985Index=1, Data_count =1991Index=2, Data_count =5041Index=3, Data_count =12014Index=4, Data_count =4969
Tested 3 times, 10,000 samples at a time;
The second Test, node 3 is bad, causing the original data stored in node 3 (0,1,3,4,6,7) are assigned to other nodes, but not all to a node, to prevent the Avalanche ;
The third Test, Node 3 resumed, the original data on node 3 was restored to 3.
Consistency hashing algorithm and C + + implementation