I. Background of the problem
When you use distributed data storage, you often encounter the need to add new nodes to meet the rapidly growing business needs. However, when a new node is added, it can be catastrophic for some systems if poorly handled results in all data re-sharding .
Is there a viable way to migrate only the nodes associated with the data when it is re-fragmented, rather than migrating the entire data? Of course, in this case we can use a consistent hash to deal with.
Second, the consistency hash algorithm background
The consistent hashing algorithm was proposed in 1997 by the MIT Karger and others in the solution of the distributed cache, designed to address hot spot issues in the Internet, with the original intent and carp very similar. The consistent hash corrects the problem caused by the simple hashing algorithm used by carp, so that DHT can actually be applied in the peer-to-peer environment.
But now the consistent hash algorithm has been widely used in distributed systems, the people who have studied memcached cache database know that the memcached server itself does not provide the consistency of the distributed cache, but is provided by the client. The following steps are used when calculating the consistency hash:
1. First, the hash value of the memcached Server (node) is calculated and configured on the 0~232 Circle (Continuum).
2, then use the same method to find the key to store the hash value of the data, and mapped to the same circle.
3, and then start from the location of the data map to the clockwise lookup, save the data to the first server found. If more than 232 still cannot find the server, it will be saved to the first memcached server.
Add a memcached server from the state. Remainder distributed algorithm because the server that holds the key changes greatly to affect the cache hit ratio, but in consistent hashing, only on the park (continuum) to increase the location of the server in the counter-clockwise direction of the key on the first server will be affected, as shown in:
Consistent Hash Property
Considering that each node of the distributed system is likely to fail, and the new node is likely to increase dynamically, how to ensure that when the number of nodes in the system can still provide good service to the outside, this is worth considering, especially when designing a distributed cache system, if a server fails, For the entire system without the appropriate algorithm to ensure consistency, then all the data cached in the system may be invalidated (that is, due to fewer system nodes, the client needs to recalculate its hash value when requesting an object (usually related to the number of nodes in the system), because the hash value has changed, So it is possible that the server node that holds the object is not found, so the consistency hash is very important, and the consistent hash algorithm in a good distributed CAHCE system should meet the following aspects:
Balance means that the result of the hash can be distributed to all buffers as much as possible, thus allowing all buffer space to be exploited. Many hashing algorithms can satisfy this condition.
- Monotonicity (monotonicity)
Monotonicity means that if some content has been allocated to the appropriate buffers by hashing, and a new buffer is added to the system, the result of the hash should be to ensure that the original allocated content can be mapped to the new buffer without being mapped to the other buffers in the old buffer collection. Simple hashing algorithms often fail to meet the requirements of monotonicity, such as the simplest linear hash: x = (ax + b) mod (p), in which p represents the size of all buffers. It is not difficult to see that when the buffer size changes (from P1 to P2), the original results of all the hashes will change, thus not meet the requirements of monotonicity. The change in the hash result means that when the buffer space changes, all mappings need to be updated within the system. In the peer-to-peer system, the change in buffering is equivalent to the peer joining or exiting system, which occurs frequently in the peer system, resulting in great computational and transmission loads. Monotonicity is the requirement that the hashing algorithm be able to cope with this situation.
In a distributed environment, the terminal may not see all of the buffers, but can see only a subset of them. The end result is that the same content is mapped to different buffers by different endpoints when the terminal wants the content to be mapped to buffering through a hashing process, because the buffer range seen by different terminals may be different, resulting in inconsistent results for the hash. This is obviously something that should be avoided because it causes the same content to be stored in different buffers, reducing the efficiency of the system's storage. The definition of dispersion is the severity of the above-mentioned situation. A good hashing algorithm should be able to avoid inconsistencies as far as possible, that is, to minimize dispersion.
The load problem is actually looking at the dispersion problem from another perspective. Since different terminals may map the same content to different buffers, it is possible for a particular buffer to be mapped to different content by different users. As with dispersion, this situation should also be avoided, so a good hashing algorithm should be able to minimize the buffering load.
Smoothness refers to the smooth change in the number of cache servers and the consistent change in the cache object's smoothing.
Three , the principle
in simple terms, a consistent hash organizes the entire hash value space into a virtual ring, such as assuming that the value space of a hash function h is 0-2^32-1 (that is, the hash value is a 32-bit unsigned shape), and the entire hash space loop is as follows:
The entire space is organized in a clockwise direction. 0 and 232-1 coincide in the direction of 0 points.
The next step is to hash each server using hash, select the server's IP or hostname as the keyword hash, so that each machine can determine its location on the Hashi, here is assumed to be the above four servers using the IP address hash after the location of the ring space is as follows:
Next, use the following algorithm to locate the data access to the appropriate server: The data key using the same function hash to calculate the hash value, and determine the position of the data on the ring, from this position along the ring clockwise "walk", the first server encountered is the server it should be located.
For example, we have object A, object B, Object C, object D four data objects, after hashing, the position on the ring space is as follows:
Based on the consistent hash algorithm, data A is set to Node A, B is set to Node B, C is set to Node C, and D is set to Node D.
The following is an analysis of the fault tolerance and extensibility of the consistent hashing algorithm. Now that node C is unfortunately down, you can see that objects a, B, d are not affected, only the C object is relocated to node D. In general, in a consistent hashing algorithm, if a server is unavailable, the affected data is only data between the server and the previous server in its ring space (that is, the first server encountered in a counterclockwise direction), and the rest is unaffected.
Consider the other case if you add a server node X to the system, as shown in:
Objects A, B, d are not affected at this time, and only object C needs to be relocated to the new node X. In general, in a consistent hashing algorithm, if you add a server, the affected data is only the data between the new server and the previous server in its ring space (that is, the first server encountered in the counterclockwise direction), and no other data is affected.
To sum up, the consistency hashing algorithm can only reposition a small subset of data in the ring space for the increment and decrease of the nodes, which has good fault tolerance and expansibility.
In addition, the consistency hashing algorithm is too young for the service node, which is prone to data skew due to uneven node division. For example, there are only two servers in the system, the ring distribution is as follows,
This inevitably results in a large amount of data being concentrated on Node A, and only a very small amount will be positioned on Node B. In order to solve this data skew problem, the consistent hashing algorithm introduces the virtual node mechanism, that is, to compute multiple hashes for each service node, and to place a service node, called a virtual node, for each computed result location. This can be done by adding numbers to the server IP or host name. For example, you can compute three virtual nodes for each server, so you can compute the hash of node a#1, node a#2, node a#3, node b#1, node b#2, and Node b#3, and then form six virtual nodes:
At the same time, the data location algorithm is not changed, just one step more virtual node to the actual node mapping, such as the location to "node A#1", "Node A#2", "Node A#3" Three virtual node data are located on node A. This solves the problem of data skew when the service node is young. In practical applications, the number of virtual nodes is usually set to 32 or greater, so even a few service nodes can achieve a relatively uniform data distribution.
In the original text see: http://www.cnblogs.com/haippy/archive/2011/12/10/2282943.html
Consistent hash algorithm and usage scenarios