Foreword: For the consistency hash is not a rare concept, this is only a collation of the original theory and with their own understanding of the story, I hope that the novice has a little help, the benefit of self-interest is sufficient.
1. Concept
A consistent hash is a special hashing algorithm. After using a consistent hashing algorithm, the change in the number of slot bits (size) of the hash table only needs to remap the k/n keyword, where K is the number of keywords and n is the number of slots. However, in a traditional hash table, it is almost necessary to remap all the keywords to add or remove a slot.
A consistent hash maps each object to a point on the edge of the ring, and the system then maps the available node machines to different positions in the ring. When looking up a machine for an object, it is necessary to use a consistent hashing algorithm to calculate the position of the object corresponding to the ring edge, find the edge of the torus until a node machine is encountered, which is where the object should be saved. When a node machine is deleted, all objects saved on the machine are moved to the next machine. When you add a machine to a point on the edge of the ring, the next machine at that point needs to move the object that corresponds to the node to the new machine. Changing the distribution of objects on the node machine can be achieved by adjusting the location of the node machine.
2. Application Scenarios
When using N cache servers, a common way of load balancing is to use a Hash (object)%N to map a request to a resource Object to a cache server. But there are two disadvantages:
1. After a server (server a) is down, the cache resources in server A are invalidated, the mapping formula becomes hash (Object)% (N-1), and the cache is invalidated, which causes the cache server to update the cache to the original content server in a large number of concentrations.
2. When a cache server is added or reduced, the mapping formula becomes a hash (Object)% (n+1), which may change the hash value for all resources, that is, all caches are invalidated, which causes the cache server to update the cache to the original content server in a large number of concentrations.
Because of this, a consistent hashing algorithm is required to avoid such problems. Consistent hashes map the same resource to the same cache server as much as possible. In this way, when a cache server is added, the new server shares the cache resources that store all other servers as much as possible. When a cache server is reduced, all other servers can also share the cache resources that store it. The main idea of a consistent hashing algorithm is to associate each cache server with one or more hash range ranges, where the interval boundary is determined by calculating the corresponding hash value for the cache server. (The hash function that defines the interval is not necessarily the same as the function that computes the cache server hash value, but the range of return values for the two functions needs to be matched.) If a cache server is removed, it will be merged from the corresponding interval into a neighboring interval, and no other cache server will need any change.
3. Brief Analysis of the principle
3.1 Ring Hash Space
Consider the usual hash algorithm to map value to a 32-bit key value, which is the value space of the 0~2^32-1. We can imagine this space as a ring with a first (0) tail (2^32-1), as shown in 1:
Figure 1 Ring Hash space
3.2 Mapping objects to ring hash space
Consider the following 4 resource object Object1,object2,object3,bject4, the hash value (key) computed by the hash function is shown in the distribution 2 on the ring:
Figure 2 The key value distribution of the resource object
So:
Key1 = Hash (Object1);
······
Key4 = Hash (OBJECT4);
3.3 Mapping the resource server to the ring hash space
The basic idea of consistent hash is to map both the resource object and the cache server (caches) to the same hash value space, and use the same hash algorithm. Caching server (cache) hash calculation, the general method can use the server (cache) machine's IP address or machine name as a hash input.
Assuming that there are currently 3 cache servers, A, B and C, the mapping results are shown in 3, and they are arranged in hash space with corresponding hash values.
Figure 3 Key value distributions for server and resource objects
So:
Keya = Hash (Cachea);
······
KEYC = Hash (CACHEC);
3.4 Mapping resource objects to a cache server
Now that the server (cache) and resource objects have been mapped to the hash value space by the same hash algorithm, the next thing to consider is how to map the object to the server (cache).
In this annular space, if you start from the object's key value in a clockwise direction until you meet a server (cache), the object is stored on this cache server because the hash value of the object and server is fixed, so the server must be unique and deterministic. This is the mapping method for objects and servers.
Continue with the above example (see Figure 3), then, according to the above method, the object Object1 will be stored on cache A; Object2 and object3 correspond to cache C; Object4 corresponds to cache B;
3.5 Consider the increase or decrease of server
Based on the mapping method above (looking clockwise), we can easily imagine the changes in the resource object mapping caused by the increase or decrease of the server. As shown in 4 and 5:
Figure 4 Map changes after cache server B is removed
Figure 5 Mapping changes after adding cache server D
According to the above, the changes of the consistency hash node need to k/n a keyword mapping, n is the number of cache servers, K is the number of resource objects, obviously 4/3 and 4/4 rounding are 1, and the above results match.
3.6 The use of virtual nodes to ensure the balance
In the distributed cache, it is expected that the resource objects can be distributed to the cache server as evenly as possible, which is also the requirement of balance. However, a consistent hash does not guarantee absolute balance. For example, in the case of a server (cache), resource objects are not uniformly mapped to the server, as in the example above
Child, only cache A and cache C are deployed, and in 4 resource objects, cache a only stores Object1, while cache C stores Object2, Object3, and Object4. Obviously, the distribution is very unbalanced.
So, to solve this situation, a consistent hash introduces the concept of a virtual node.
Virtual node is the actual server node in the hash space of one or more replicas (replica), a real node corresponding to a number of "virtual node", the corresponding number is "copy number", "Virtual node" in the hash space in the hash value of the arrangement.
As an example of deploying only cache A and cache C, we have seen in Figure 4 that the cache node is unevenly distributed. Now we introduce the virtual node, and set the "number of copies" to 2, which means there will be 4 "virtual nodes", the cache A1, the cache A2 represents the cache A; cache C1, the cache C2 represents the cache C; Suppose a more ideal case, 6 :
Figure 6 Mapping relationships after the introduction of virtual nodes
At this point, the mapping of the resource object to the "virtual node" is: Objec1->cache A2; objec2->cache A1; Objec3->cache C1; Objec4->cache C2; object Object1 and Object2 are mapped to cache a, while OBJECT3 and OBJECT4 are mapped to cache C, and the balance is greatly improved.
After the "Virtual node" is introduced, the mapping relationship is transformed from {object---node} to {Object-and-virtual node}. The mapping relationship 7 is shown when querying the cache of an object:
Figure 7 Querying the calculation process for a resource object
The calculation of the hash value of the virtual node can be based on the IP address of the corresponding node plus the digital suffix. For example, assume that the IP address of Cache A is 202.168.14.241.
Before introducing the virtual node, calculate the hash value of cache A: hash ("202.168.14.241");
After introducing the virtual node, compute the hash value of the virtual node cache A1: Hash ("202.168.14.241#1");
The following URL tells the details: http://www.codeproject.com/Articles/56138/Consistent-hashing
4.c++ Implement
The consistency hash must also solve two problems, one is the choice of data structure for node storage and lookup, and the other is the choice of hash algorithm.
We can imagine that the nodes are evenly distributed across your ring, meaning that the hash value of the node can be stored on an ordered queue. At the same time, the data structure should be efficient to support the frequent deletion of nodes, and the ideal query efficiency, then naturally think of red and black trees, it is an approximate balance of two fork tree, because operations such as inserting, deleting and finding a value of the worst-case time is required to be proportional to the height of the tree, This theoretical upper limit of height allows the red and black trees to be efficient in the worst case, unlike the normal two-fork search tree. Therefore, the data structure for node storage and lookup We select the red-black tree and implement its insert, delete, find function, and add a lookup function to find the smallest node in the key.
The choice of hash algorithm, the consistency hash is very important to solve the load balancing problem, and the virtual node is the key to load balancing, so we want the virtual node to be able to walk evenly on the ring. Here we choose the MD5 algorithm, through the MD5 algorithm, you can transform an identity string (used to mark virtual nodes), get a 16-byte character array, and then handle the array to get an integer hash value. Because the MD5 has a high degree of dispersion, the resulting hash value will also be very discrete and will be uniformly hashed onto the "ring".
Seeing is believing, actual code link:
Http://files.cnblogs.com/coser/ConsistentHashAlgorithm.rar
5.PHP Implement
The following is a PHP implementation that goes from http://blog.csdn.net/mayongzhan/article/details/4298834:
Consistent hash (consistent Hashing)