Java Consistent Hash algorithm learning notes (sample code), consistenthash
This article focuses on the ConsistentHashing algorithm code.
Consistent Hash
The consistent hash algorithm was proposed by the Massachusetts Institute of Technology (see 0) in 1997. It was designed to solve the Hot pot problem on the Internet. Its original intention was similar to CARP. Consistent hash corrected the problems caused by the simple hash algorithm used by CARP, so that DHT can be truly applied in P2P environments.
Hash Algorithm
Consistent hash proposes four adaptive conditions that the hash algorithm must meet in a dynamically changing Cache environment:
Balance)
Balance means that hash results can be distributed to all caches as much as possible, so that all cache spaces can be used. Many hash algorithms can meet this condition.
Monotonicity)
Monotonicity means that if some content has been hashed to the corresponding cache and a new cache has been added to the system. The hash result should ensure that the original allocated content can be mapped to the new cache instead of other buffers in the old cache set.
Simple hashing algorithms often cannot meet the requirements of monotonicity, such as the simplest linear hashing:
X → ax + B mod (P)
In the above formula, P indicates the size of all caches. It is not hard to see that when the cache size changes (from P1 to P2), all original hash results will change and thus do not meet the monotonicity requirements.
Changing the hash result means that when the cache space changes, all mappings must be updated in the system. In a P2P system, cache changes are equivalent to adding Peer to or exiting the system. This situation occurs frequently in P2P systems, resulting in great computing and transmission load. Monotonicity means that the hash algorithm is required to avoid this situation.
Spread)
In a distributed environment, the terminal may not see all the caches, but only some of them. When the terminal wants to map content to the cache through the hash process, the cache range seen by different terminals may be different, resulting in inconsistent hash results, the final result is that the same content is mapped to different cache zones by different terminals. This situation should be avoided because it causes the same content to be stored in different buffers, reducing the system storage efficiency. Dispersion is defined as the severity of the above situation. A good hash algorithm should be able to avoid inconsistencies as much as possible, that is, to minimize dispersion.
Load)
The load problem is actually a problem of decentralization from another perspective. Since different terminals may map the same content to different buffer zones, different users may map different content to a specific buffer zone. This situation should also be avoided like dispersibility. Therefore, a good hash algorithm should be able to reduce the buffer load as much as possible.
On the surface, consistent hash is intended for Distributed caching. However, if we regard the buffer as a Peer in a P2P system, we will regard the mapped content as a variety of shared resources (data, files, media streams), you will find that the two are actually describing the same problem.
Routing Algorithm
In a consistent hash algorithm, each node (corresponding to the Peer in the P2P system) has a randomly assigned ID. When ing content to a node, use the keyword of the content and the node ID to perform consistent hash and obtain the key value. Consistent hash requires that the key value and node ID are in the same value range. The simplest key values and IDS can be one-dimensional, such as an integer set from 0000 to 9999.
When the content is stored based on the key value, the content is stored on the node with the ID closest to the key value. For example, if the key value is 1001 and the system has nodes with IDs of 1100, 1000, and, the content will be mapped to nodes.
To build the route required for the query, consistent hash requires each node to store its upstream node (the smallest of its own nodes) and downstream node (the ID value is smaller than the largest of its own nodes) location Information (IP address ). When a node needs to find the content, it can initiate a query request to the upstream or downstream node based on the key value of the content. If the node that receives the query request finds that it has the requested target, it can directly return confirmation to the node that initiates the query request. If it finds that it is not within its own range, you can forward requests to your upstream/downstream nodes.
To maintain the preceding route information, when a node joins or exits the system, the adjacent node must update the route information in a timely manner. This requires that the node not only stores the directly connected downstream node location information, but also knows the indirect downstream node information of a certain depth (n hops) and dynamically maintains the node list. When a node exits the system, its upstream node attempts to directly connect to the nearest downstream node. After the connection is successful, it obtains the list of downstream nodes from the new downstream node and updates its node list. Similarly, when a new node is added to the system, first locate the downstream node based on its ID and obtain the downstream node list. Then, the upstream node is required to modify its downstream node list, in this way, the routing relationship is restored.
Discussion
Consistent hash basically solves the most critical problem in P2P environments-how to distribute storage and routes in a dynamic network topology. Each node only needs to maintain information about a small number of adjacent nodes. When a node joins or exits from the system, only a small number of nodes are involved in topology maintenance. All this makes consistent hash the first practical DHT algorithm.
However, there are still some shortcomings in the consistent hash routing algorithm. During the query, the queried message must go through the O (N) Step (O (N) to indicate a direct relationship with N, and N to represent the total number of nodes in the system) to reach the queried node. It is hard to imagine that when the system scale is very large, the number of nodes may exceed 1 million, and the query efficiency is obviously difficult to meet the needs. From another perspective, even if the user can endure a long delay, a large amount of messages generated during the query process will bring unnecessary load to the network.
Source code:
Package heritrix; import java. util. collection; import java. util. sortedMap; import java. util. treeMap; public class ConsistentHash <T> {// hash algorithm private final HashFunction hashFunction; // Number of virtual nodes private final int numberOfReplicas; private final SortedMap <Integer, t> circle = new TreeMap <Integer, T> (); public ConsistentHash (HashFunction hashFunction, int numberOfReplicas, Collection <T> nodes) {this. hashFunction = HashFunction; this. numberOfReplicas = numberOfReplicas; for (T node: nodes) {add (node) ;}} public void add (T node) {for (int I = 0; I <numberOfReplicas; I ++) {circle. put (hashFunction. hash (node. toString () + I), node) ;}} public void remove (T node) {for (int I = 0; I <numberOfReplicas; I ++) {circle. remove (hashFunction. hash (node. toString () + I) ;}// key algorithm public T get (Object key) {if (circle. isEmpty () {return nu Ll;} // calculate the hash value int hash = hashFunction. hash (key); // if this hash value is not included if (! Circle. containsKey (hash) {SortedMap <Integer, T> tailMap = circle. tailMap (hash); hash = tailMap. isEmpty ()? Circle. firstKey (): tailMap. firstKey ();} return circle. get (hash );}}
Summary
The above is all the content of this article on the Java Consistent Hash algorithm learning notes (sample code), I hope to help you. If you are interested, you can continue to refer to other related topics on this site. If you have any shortcomings, please leave a message. Thank you for your support!