A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
before talking about consistent hashing, first review hashmap.
When using HashMap, the key is mapped evenly into the HashMap internal array (Entry ""), and the mapping method uses the hash value of key to do the shift operation. And the length of the entry array (Length-1) is done with operations (unlike Hashtable, (Key.hashcode () & 0x7FFFFFFF)% Table.length, which makes the data distributed evenly).
When the put operation, when the amount of data in the underlying array exceeds loadfactor (default 0.75) * Len, the HashMap expands the size of the underlying array by twice-fold expansion. The new data introduced by the put operation is mapped according to the method mentioned above, but what about the previously mapped data? HashMap source shows that the resize () operation, each expansion will be the previously mapped key remap.
So for HashMap, to get better performance, you have to estimate the size of the set of data in advance to design the appropriate initial capacity and load factor.
But not every scene is as simple as hashmap, such as the presence of hundreds or thousands of servers in a large peer network, where the relationship between resources and servers is mapped in key form, that is, a large hashmap, Maintenance of each resource on which server, if there is a new node join or exit the network cluster, like HashMap, will also cause the mapping relationship changes, it is obviously impossible to all the key and the server mapping relationship again, which will cause a large number of service misses, directly causing the service to be unavailable. This requires a way for the new server to join or the server to go down without having to adjust all the nodes to continue to maintain the hash mapping relationship. The consistency hash is therefore defined as:
"Consistent hashing is schema, provides hash table functionlity in a", the addition or removal of one slot does Not significantly change the mapping of keys to the slots. (wiki) ";
That is, "consistent hashing is the provision of a hashtable that does not cause significant changes in the mapping relationship when nodes are joined and left."
the definition of a consistent hash, in addition to describing a definition or an idea, does not give any description of the implementation, and all the refinement issues are left to the developer to think about, but the general implementation idea is as follows:
1: Assume that all resources are distributed evenly over a ring with a one-way hash of key (SHA-1,SHA-2,MD5);
2: Assume that all nodes (servers) are evenly distributed across the ring (with the ip,port one-way hash of the server);
3: Each node (server) is only responsible for part of the key, when the node joins, exits only affects the join the exit node and its neighbor nodes or other nodes only a small number of keys are affected;
The general consistency hash needs to meet the following conditions to make sense for the actual system:
1: balance (Balance): The balance is that the result of the hash can be distributed to all nodes as far as possible, so that all the node space can be exploited. Many hashing algorithms can satisfy this condition.
2: monotonicity (monotonicity): Monotonicity refers to the addition of a new node to the system if some content has been assigned to the corresponding node by hashing. The result of the hash should be to ensure that the original allocated content can be mapped to an existing or new node , rather than being mapped to another node in the old node collection .
3: dispersion (Spread): In a distributed environment, the terminal may not see all of the caches , but can see only a subset of them. The end result is that the same content is mapped to different buffers by different terminals when the terminal wants the content to be mapped to the cache through a hashing process , because the cache range seen by different terminals may be different , resulting in inconsistent results for the hash. This is obviously something that should be avoided because it causes the same content to be stored in different caches, reducing the efficiency of the system's storage . The definition of dispersion is the severity of the above-mentioned situation. A good hashing algorithm should be able to avoid inconsistencies as far as possible, that is, to minimize dispersion.
4: load: The load problem is actually looking at the dispersion problem from another perspective . Since different terminals may map the same content to different buffers , it is possible for a particular buffer to be mapped to different content by different users . As with dispersion, this situation should also be avoided, so a good hashing algorithm should be able to minimize the buffering load.
Ring Hash Space:
According to the commonly used hash algorithm to hash the corresponding key into a space with a 2^32 bucket, namely 0~ (2^32)-1 of the digital space. Now we can connect these numbers to each other and think of them as a closed loop, such as:
The data is mapped to the ring after processing it through a certain hash algorithm:
Now the Object1, Object2, Object3, object4 Four objects are calculated with a specific hash function for the corresponding key value, and then hashed to the hash ring, such as:
Hash (Object1) = Key1;
Hash (OBJECT2) = Key2;
Hash (OBJECT3) = Key3;
Hash (OBJECT4) = Key4;
specific hash algorithms, such as MD5->128bit, Sha-2, 160bit, take the first 32bit to form an integer, mapped to the ring.
The machine is mapped to the ring via a hash algorithm:
A new machine is added to a distributed cluster with a consistent hashing algorithm, The principle is that by using the same hash as the object store Souan the machine is also mapped to the ring (in general, the machine's hash is calculated using the machine's IP or machine's unique alias and service port as the input value), and then in a clockwise direction, all objects are stored in the machine closest to their own.
Suppose now there are Node1, Node2, Node3 three machines, through the hash algorithm to get the corresponding key value, mapped to the ring, which is as follows:
Hash (NODE1) = KEY1;
Hash (NODE2) = KEY2;
Hash (NODE3) = KEY3;
You can see that the object and machine are in the same hash space, so that it goes clockwise to object1 storage into NODE1, object3 is stored in NODE2, OBJECT2, Object4 is stored in Node3. In such a deployment environment, the hash ring is not changed, so, by calculating the object's hash value can be quickly positioned to the corresponding machine, so that the object can find the real storage location.
Removal and addition of machines:
Common hash algorithm is the most inappropriate place is in the machine after the addition and deletion will make a large number of object storage location invalidation, so that do not meet the monotony of playing, the following two analysis of how the consistency hash algorithm is handled.
1: Deletion of the node (machine)
With the above distribution as an example, if the NODE2 has failed to delete, then according to the method of clockwise migration, OBJECT3 will be migrated to NODE3, so that only the mapping location of OBJECT3 has changed, the other objects do not have any mapping relationship changes. Such as:
2: Addition of nodes (machines)
If you add a new node NODE4 to cluster, KEY4 is obtained by the corresponding hash algorithm and mapped to the ring, such as:
by moving the rules clockwise, the Object2 is migrated to NODE4, and the other objects retain their original mappings (where they are stored). Through the Analysis of node addition and deletion, the consistency hashing algorithm keeps the monotonous colleague, also makes the data migration to be minimal, such algorithm is suitable for the distributed cluster, avoids the massive data migration, reduces the server pressure.
Balance (Uniform data distribution):
According to the above diagram analysis, the consistency hashing algorithm satisfies the characteristics of monotonicity and load balancing and the dispersion of the general hash algorithm, but this is not the reason why it is widely used, because of the lack of balance. The following will analyze how the consistent hashing algorithm is balanced. The hash algorithm is not guaranteed to be balanced, such as the case where only NODE1 and NODE3 are deployed (NODE2 deleted), Object1 is stored in NODE1, and Object2,object3,object4 is stored in NODE3. This is a very unbalanced state, in a consistent hashing algorithm, in order to satisfy the balance as much as possible, it introduces the virtual node:
-"Virtual node" is the actual node (machine) in the hash space of the replica (replica), a real node (machine) corresponding to a number of "virtual node", the corresponding number has become "Replication Number", "Virtual node" in the hash space in the hash value.
As an example of the above only deployed NODE1 and NODE3, the previous objects are unevenly distributed on the machine, now we take 2 copies as an example, so there are 4 virtual nodes on the entire hash ring, and the graph of the last object mappings is as follows:
The introduction meaning of "virtual node" is that the actual nodes are small and the large tracts are not mapped, which results in uneven distribution of data. Suppose each server maps n nodes (n is better at 100~200), but the hash of the key is mapped to the N nodes that are actually hosted by that server.
Based on a known object mapping relationship:object1-> node1-1, Object2, Node1-2,object3, Node3-2, Object4 Node3-1. Through the introduction of virtual nodes, the distribution of objects is more balanced, so how do real object queries work in real operations? Conversions of objects from hash to virtual node to actual node such as:
"Virtual node" hash calculation can take the corresponding actual node IP address plus a digital suffix method, add the Assumption NODE1 IP address is 192.168.1.100. Before introducing "virtual node", calculate the hash value of the Cachea:
After introducing "virtual node", compute the hash value of the "Imaginary node" point Node1-1 and Node1-2:
The number of server stations (n) and increased number of servers (m) is calculated after the increase in the hit rate as follows:
(1- N/(n + M)) * 100
Consistent hash (consistent Hashing)
Start building with 50+ products and up to 12 months usage for Elastic Compute Service