The consistent hashing algorithm is a frequently used algorithm in distributed systems. For example, a distributed storage system that stores data on a detailed node. Suppose to use the ordinary hash method. Map the data to a detailed node, such as key%n. Key is the key to the data. n is the number of machine nodes. Assuming that there is a machine that adds or exits the cluster, the entire data map is invalid. Assuming that persistent storage is a data migration, assuming a distributed cache, other caches are invalidated.
Therefore, a consistent hashing algorithm is introduced:
The data is mapped into a very large space using a hash function (such as MD5). What you see. When the data is stored, a hash value is obtained first. corresponding to each position in this ring, such as the K1 corresponding to the position seen in the figure, and then along the clockwise to find a machine node B, the K1 stored in the Node B.
Suppose the b node goes down. The data on B falls to the C node, for example, as seen in:
In this way, only the C node is affected, and the other Node A. D does not affect the data. However, this will create an "avalanche" situation. The C node is responsible for the data of the B-node. So the load on the C node becomes higher. C-node is very easy also down, so in turn, so that the entire cluster is hung.
To this end, the concept of "virtual node" has been introduced: that is, there are very many "virtual nodes" in this ring, the storage of data is to find a virtual node in the clockwise direction of the ring, each virtual node will be associated to a real node, for example, using:
The figure of A1, A2, B1, B2, C1, C2, D1, D2 are virtual nodes, machine a load storage A1, A2 data, machine B load Storage B1, B2 data, machine C load Storage C1, C2 data. Because the number of virtual nodes is very large. Evenly distributed, so it does not cause an "avalanche" phenomenon.
The consistent hashing algorithm was proposed in 1997 by the MIT Karger and others in the solution of the distributed cache, designed to address hot spot issues in the Internet, with the original intent and carp very similar. The consistent hash corrects the problem caused by the simple hashing algorithm used by carp, allowing DHT to be truly applied in a peer-to-peer environment.
but now the consistent hash algorithm has been widely used in distributed systems, and people who have studied memcached cache database know that The Memcachedserver side itself does not provide the consistency of the distributed cache, but is provided by the client, in detail in calculating the consistency hash, such as the following steps:
- The hash value of the Memcachedserver (node) is first calculated. and configure it to the 0~232 Circle (Continuum).
- It then uses the same method to find the hash value of the key that stores the data. and mapped to the same circle.
- It then starts from the location where the data is mapped, and saves the data to the first server found. Assuming that more than 232 still cannot find the server, it will be saved to the first memcachedserver.
Add a memcachedserver from the state. The remainder distribution algorithm affects the cache hit ratio because the server that holds the key changes dramatically. But in consistent hashing. The keys on the first server in the counter-clockwise direction will be affected by simply adding the server's location on the park (continuum), for example, as seen in:
Consistent Hash Property
Considering that every node in a distributed system is likely to fail, and that new nodes are likely to be added dynamically. How to ensure that when the number of nodes in the system can still provide good service to the outside, this is worth considering, especially when designing the distributed cache system. Assuming that a server fails, for the entire system assumes that the appropriate algorithm is not used to ensure consistency, then all the data cached in the system can be invalidated (that is, because the number of system nodes is reduced, The client needs to compute its hash value again (usually related to the number of nodes in the system) when requesting an object. Because the hash value has changed. Therefore, it is very likely that the server node where the object is saved is not found. Therefore, the consistency hash is very important. The consistent hash algorithm in a good distributed CAHCE system should meet the following aspects:
The balance is that the result of the hash can be distributed as far as possible into the full buffer, so that all the buffer space can be exploited. This condition can be met by a very many hashing algorithms.
- Monotonicity (monotonicity)
Monotonicity refers to the assumption that some content has been allocated to the corresponding buffer by hashing, and that a new buffer is added to the system, then the result of the hash should ensure that the original assigned content can be mapped to the new buffer. Instead of being mapped to other buffers in the old buffer collection.
Simple hashing algorithms often fail to meet the requirements of monotonicity, such as the simplest linear hash: x = (ax + b) mod (P). In the upper style, p represents the size of all buffers. It is not difficult to see that when the buffer size changes (from P1 to P2), the original results of all the hashes will change, thus not meet the requirements of monotonicity. The change in the hash result means that when the buffer space changes, all mappings need to be updated within the system. In the peer-to-peer system, the change in buffering is equivalent to the peer increment or exit system, which occurs frequently in the peer-to-peer system, resulting in great computational and transmission loads. Monotonicity is the requirement that hashing algorithms be able to cope with such situations.
In a distributed environment, the terminal may not see all of the buffers. But only to see a part of it. The end result is that the same content is mapped to different buffers by different terminals when the terminal wants the content to be mapped to buffering through a hashing process, because the buffer range seen by different terminals may be different, resulting in inconsistent results for the hash.
Such a situation is clearly to be avoided. Because it causes the same content to be stored in different buffers, the efficiency of the system storage is reduced.
The definition of dispersion is the severity of the above-mentioned situation. A good hashing algorithm should be able to avoid inconsistencies when possible. This means minimizing dispersion.
The load problem is actually a matter of looking at the dispersion from another perspective.
Since different terminals may map the same content to different buffers. It is also possible for a particular buffer to be mapped to different content by different users. As with dispersion, such a situation should also be avoided, so a good hashing algorithm should be able to minimize the buffering load.
Smoothness refers to the consistent number of cache server changes and the smooth change of cached objects.
Basic concept of principle
The consistent hashing algorithm (consistent Hashing) was first published in the paper consistent Hashing and Random trees:distributed Caching protocols for relieving hot Spot s on the World Wide Web. In simple terms, a consistent hash organizes the entire hash value space into a virtual ring. For example, if the value space of a hash function h is 0-2^32-1 (that is, the hash value is a 32-bit unsigned shape), the entire hash space ring is as follows:
The entire space is organized in a clockwise direction.
0 and 232-1 coincide in the direction of 0 points.
The next step is to hash each server with a hash. Detailed ability to select the IP or hostname of the server to hash as keyword. This allows each machine to determine its location on the Hashi, if the above four servers use the IP address hash after the location of the ring space such as the following:
Next use an algorithm such as the following to locate data access to the corresponding server: The data key using the same function hash to calculate the hash value, and determine the location of this data on the ring. Walking clockwise from this location along the ring, the first server encountered is the server to which it should be located.
For example, we have object A, Object B, Object C, object D four data objects, after hashing, the position on the ring space such as the following:
Based on a consistent hashing algorithm. Data A is determined to be on Node A, and B is set to Node B. C is set to node C and D is set to Node D.
The following analyzes the fault tolerance and extensibility of the consistent hashing algorithm.
Now if node C unfortunately goes down. You can see that objects a, B, and D are not affected at this point, except that the C object is relocated to node D. In general. In a consistent hashing algorithm. If a single server is unavailable. The affected data However, the server is not affected by the data between the previous server in its ring space (the first server encountered in the counterclockwise direction).
Consider the second scenario below, assuming that a servernode X is added to the system, for example, as seen:
Object A, B, and D are not affected at this time, except that object C needs to be relocated to the new node X.
Generally, in a consistent hashing algorithm, if you add a server, the affected data is not affected by the data between the new server and the previous server in its ring space (that is, the first server encountered in a counterclockwise direction).
In summary, the consistency hashing algorithm only needs to reposition a small portion of the data in the ring space to increase or decrease the node. It has good fault-tolerant and extensibility.
In addition, the consistent hashing algorithm is too young for service nodes. Easy due to uneven node division caused by data skew problem. For example, there are only two servers in the system. Its ring distribution is as follows,
This will inevitably result in a large amount of data being concentrated on Node A, and only a very small amount would be positioned on Node B.
To address such data skew, the consistent hashing algorithm introduces a virtual node mechanism that computes multiple hashes for each service node, each of which places a service node, called a virtual node.
Detailed practices can be implemented by adding numbers to the ServerIP or host names. Like the situation above. The ability to compute three virtual nodes per server. So we can calculate the hash value of "Node A#1", "Node A#2", "Node A#3", "Node b#1", "Node B#2" and "Node B#3" respectively, thus forming six virtual nodes:
At the same time, the data location algorithm is not changed, but only one step more virtual node to the actual node mapping. For example, the data that locates "node a#1", "Node A#2", "Node A#3" three virtual nodes are located on node A. This overcomes the problem of data skew when the service node is young. In practical applications, the number of virtual nodes is usually set to 32 or greater, so even very few service nodes can achieve a relatively uniform data distribution.
Article Source: http://www.cnblogs.com/haippy/archive/2011/12/10/2282943.html. Http://www.blogjava.net/hello-yun/archive/2012/10/10/389289.html
Consistent hash algorithm