Consistent hashing algorithm (consistent hashing) __ algorithm

Source: Internet
Author: User
Reprint please indicate the source: http://blog.csdn.net/cywosp/article/details/23397179 consistent hashing algorithm a distributed hash (DHT) implementation algorithm proposed by MIT in 1997, The design goal is to solve the hot spot problem in the Internet, the original intention and carp very similar.      Consistent hashing fixes the problem with the simple hashing algorithm used by Carp, which allows distributed hashing (DHT) to be truly applied in peer-to-peer environments. The consistent hash algorithm proposes four definitions for determining the hash algorithm in a dynamically changing cache environment: 1, Balance (Balance): the balance means that the result of the hash can be distributed to all buffers as far as possible, so that all buffer space can be exploited. Many hashing algorithms can satisfy this condition. 2. Monotonicity (monotonicity): monotonicity means that if some content has been allocated to the corresponding buffer by hashing, new buffers are added to the system.  The result of the hash should be that the original allocated content can be mapped to the original or new buffer without being mapped to other buffers in the old buffer set. 3, Dispersibility (spread): In a distributed environment, the terminal may not see all the buffers, but only to see a part of it. When a terminal wants to map content to a buffer through a hashing process, because the buffer ranges seen by different terminals may be different, resulting in inconsistent results of the hash, the end result is that the same content is mapped to different buffers by different terminals. This situation is clearly to be avoided because it causes the same content to be stored in different buffers, reducing the efficiency of system storage. The definition of dispersibility is the severity of the above situation.  A good hashing algorithm should be able to avoid inconsistencies as much as possible, that is, to minimize dispersibility. 4, Load: Load problem is actually from another angle to see the dispersion problem. Since different terminals may map the same content to different buffers, it is possible for a particular buffer to be mapped by different users to different content. As with dispersibility, this should be avoided, so a good hashing algorithm should be able to minimize the load on the buffer.
In the distributed cluster, it is the most basic function of distributed cluster Management to remove the machine, or to automatically detach from the cluster after machine failure. If the use of commonly used hash (object)%n algorithm, then the machine is added or deleted, many of the original data can not be found, such a serious violation of the monotony of the principle. The next major explanation is how the consistent hashing algorithm is designed:
Annular Hash SpaceAccording to the commonly used hash algorithm, the corresponding key is hashed into a space with a 2^32 bucket, that is, the 0~ (2^32)-1 digital space. Now we can connect these numbers up and down, and think of them as a closed ring. The following figure The data is mapped to the ring after processing by a certain hash algorithm.Now we will Object1, Object2, Object3, Object4 four objects through a specific hash function to calculate the corresponding key value, and then hash to the hash ring. The following figure: hash (object1) = Key1 hash (object2) = Key2; hash (object3) = Key3; hash (OBJECT4) = Key4;
map the machine to the ring through the hash algorithmNew machines are added to a distributed cluster with a consistent hash algorithm, the principle is to map the machine to the ring by using the same hash algorithm as the object store (in general, the hash of the machine is based on the machine's IP or the machine's unique alias as the input), and then in clockwise direction, Store all objects in a machine that is closest to you. Suppose there are now node1,node2,node3 three machines, through the Hash algorithm to get the corresponding key value, map to the ring, its schematic diagram is as follows: hash (NODE1) = KEY1; Hash (NODE2) = KEY2; Hash (NODE3) = KEY3;
The image above shows that the object is in the same hash space as the machine, so that the clockwise rotation object1 is stored in the NODE1, Object3 is stored in the NODE2, object2 and Object4 are stored in the NODE3. In such a deployment environment, the hash ring is not changed, so by calculating the hash value of the object can be quickly positioned into the corresponding machine, so you can find the real storage location of the object.
Removal and addition of machinesCommon hash algorithm is the most improper place is added or deleted after the machine will be a large number of objects stored in the location of the failure, so that is not satisfied with the monotony of the. Let's analyze how the consistent hashing algorithm is handled. 1. Node (machine) to remove the above distribution as an example, if the NODE2 failure was deleted, then the method of clockwise migration, OBJECT3 will be migrated to the NODE3, this is only the OBJECT3 mapping location has changed, other objects without any changes. The following figure:
2. Node (machine) Add if a new node NODE4 is added to the cluster, the KEY4 is obtained by the corresponding hash algorithm and mapped to the ring, as shown in the following diagram:
By moving the rules clockwise, Object2 is migrated to the NODE4, and other objects retain the original storage location. Through the analysis of the node's addition and deletion, the consistent hashing algorithm keeps the monotony while the data migration reaches the minimum, so the algorithm is very suitable for the distributed cluster, avoids a lot of data migration and reduces the pressure of the server.
of BalanceAccording to the graphic analysis above, the consistent hashing algorithm satisfies the characteristics of monotonicity and load balancing as well as the dispersion of the general hash algorithm, but it is not used as the reason for its widespread application, because it lacks of balance. The following is an analysis of how the consistent hashing algorithm satisfies the balance. The hash algorithm is not guaranteed to be balanced, such as the case where only NODE1 and NODE3 are deployed (NODE2 the deleted diagram), Object1 is stored in NODE1, Object2, OBJECT3, object4 are stored in NODE3, This is a very unbalanced state.     In the consistent hashing algorithm, the virtual node is introduced in order to satisfy the balance as much as possible. --"Virtual node" is the actual node (machine) in the hash space of the replica (replica), a real node (machine) corresponding to a number of "virtual nodes", the corresponding number also become "copy number", "Virtual node" in the hash A hash value is arranged in the space. Take the case where the NODE1 and NODE3 are deployed only (NODE2 deleted). Before the object on the machine distribution is very uneven, now we take 2 copies (number of copies) as an example, so that the entire hash ring in the existence of 4 virtual nodes, the final object mapping diagram is as follows:
Based on the above figure, the mapping relation of the object: Object1->node1-1,object2->node1-2,object3->node3-2,object4->node3-1. Through the introduction of virtual nodes, the distribution of objects is more balanced. So in practice, the real object query is how to work. The transformation of an object from hash to virtual node to the actual node is as follows:
The hash calculation of "virtual node" can be based on the IP address of the corresponding node plus the digital suffix. For example, suppose the IP address of NODE1 is 192.168.1.100. Before "Virtual node" is introduced, the hash value of cache A is computed: hash ("192.168.1.100"); After the introduction of "Virtual Node", Calculate the "virtual section" point Node1-1 and Node1-2 hash value: hash ("192.168.1.100#1"); Node1-1 Hash ("192.168.1.100#2"); Node1-2
Reference: [1] http://blog.huanghao.me/?p=14

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.