tencent2012 Pen question additional question
Problem Description: For example, a mobile friend network has n servers, in order to facilitate user access will be cached on the server data, so the user each time access to the best to maintain the same server.
The existing practice is to obtain the requested server according to Serveripindex[qqnum%n], this method is convenient to separate the user to different servers up. But if a server dies, then n becomes n-1, then serveripindex[qqnum%n] is basically different from serveripindex[qqnum% (n-1), so most users ' requests go to the other server, This can cause a large number of access errors.
Q: How to improve or change a method that makes:
(1) When a server dies, it will not cause a large area of access error,
(2) The original access to the basic or stay on the same server;
(3) Consider load balancing as much as possible. (Idea: Consider the distributed consistent hashing algorithm.) )
- The most soil way is still using the method of mold rest: The procedure is very simple, assuming that there are N servers, now intact is M (m<=n), first use N to find a mold, if not fall on a good machine, and then use N-1 to find the mold, until M. This method has better stability in the case of bad machines.
- A consistent hashing algorithm.
Next, the remainder of this article focuses on this consistent hashing algorithm.
Application Scenarios
There are many load balancing algorithms available for server load balancing, including round robin (Round Robin), hashing (hash), least-connection (Least Connection), Response-Speed algorithm (Response time), Weighted method (Weighted) and so on. The hashing algorithm is the most commonly used algorithm.
The typical scenario is that there are n servers that provide caching services, load balance servers, distribute requests evenly to each server, and each machine is responsible for 1/n services.
The commonly used algorithm is to hash the results of the remainder (hash () mod N
): The machine number from 0 to N-1, according to the custom hash () algorithm, the hash () value of each request by n modulo, get the remainder I, and then distribute the request to the machine number I. However, this algorithm has a fatal problem, if a machine is down, then the request should fall on the machine can not be properly processed, when the server needs to be removed from the algorithm, this time there will be (N-1)/n Server cache data needs to be recalculated; If you add a machine, there will be N/( N+1) The cached data for the server needs to be recalculated. For systems, this is usually an unacceptable bump (because it means a large number of cache failures or data needs to be transferred). So, how do you design a load balancing strategy so that the affected requests are as small as possible?
Consistent Hashing algorithm is used in memcached, Key-value Store, Bittorrent DHT and LVs, so consistent Hashing is the preferred algorithm for distributed system load balancing.
Consistent hashing Algorithm Description
The following is an example of the Consisten hashing algorithm in memcached.
Consistent hashing algorithm was put forward in the paper consistent hashing and random trees in 1997, and is widely used in the cache system.
1 Basic Scenarios
For example, if you have n cache server (hereafter referred to as cache), then how to map an object to n cache, you are likely to use a common method like the following to calculate the hash value of object, and then map evenly to the n cache;
Hash (object)%N
Everything is running normally, consider the following two cases;
- A cache server m down (which must be considered in the actual application) so that all objects mapped to the cache m will be invalidated, what to do, need to remove the cache m from the cache, when the cache is N-1, the mapping formula becomes a hash (object)% (N-1);
- Because the access is aggravating, need to add the cache, this time the cache is n+1, the mapping formula becomes the hash (object)% (n+1);
What does 1 and 2 mean? This means that suddenly almost all of the caches are dead. For the server, this is a disaster, flood-like access will be directly rushed back to the server, and then to consider the third problem, because the hardware capabilities are getting stronger, you may want to add more nodes to do more work, obviously the above hash algorithm can not be done.
Is there any way to change this situation, this is consistent hashing.
2 hash Algorithm and monotonicity
A measure of the Hash algorithm is monotonicity (monotonicity), which is defined as follows:
Monotonicity refers to the addition of a new buffer to the system if some content has been allocated to the corresponding buffer by hashing. The result of the hash should be to ensure that the original allocated content can be mapped to a new buffer without being mapped to another buffer in the old buffer collection.
Easy to see, above the simple hash algorithm hash (object)%N difficult to meet the monotonicity requirements.
Principle of the 3 consistent hashing algorithm
Consistent hashing is a hash algorithm, in a nutshell, when removing/adding a cache, it can change the existing key mappings as small as possible, and satisfy the monotonic requirements as much as necessary.
Here are the basic principles of the consistent hashing algorithm in 5 steps.
3.1 Ring Hash Space
Consider that the usual hash algorithm is to map value to a key value of 32, which is the value space of the 0~2^32-1; we can think of this space as a ring with a first (0) tail (2^32-1), as shown in Figure 1 below.
Figure 1 Ring Hash space
3.2 Mapping objects to the hash space
Next consider 4 objects Object1~object4, the hash function calculated by the hash value of key on the ring distribution 2 is shown.
Hash (object1) = Key1;
... ...
Hash (OBJECT4) = Key4;
Figure 2 Key value distributions for 4 objects
3.3 Mapping the cache to the hash space
The basic idea of consistent hashing is to map both the object and the cache to the same hash value space, and use the same hash algorithm.
Assuming that there are currently a A, a, a and C a total of 3 caches, then its mapping results will be 3, they are in the hash space, the corresponding hash value arrangement.
Hash (cache a) = key A;
... ...
Hash (cache c) = key C;
Figure 3 Key value distributions for cache and objects
Speaking of which, by the way, the cache hash calculation, the general method can use the cache machine's IP address or machine name as a hash input.
3.4 Mapping objects to the cache
Now that both the cache and the object have been mapped to the hash value space using the same hash algorithm, the next thing to consider is how to map the object to the cache.
In this annular space, if you start from the object's key value in a clockwise direction until you meet a cache, the object is stored on the cache because the hash value of the object and cache is fixed, so the cache must be unique and deterministic. Did you find the mapping method for the object and cache?!
Continue with the above example (see Figure 3), then, according to the above method, the object Object1 will be stored on cache A; Object2 and object3 correspond to cache C; Object4 corresponds to cache B;
3.5 Review the change of the cache
Said before, through the hash and then the method of redundancy is the biggest problem is not to meet the monotony, when the cache changes, the cache will fail, and then the background server caused a huge impact, now to analyze and analyze the consistent hashing algorithm.
3.5.1 Removing the cache
Consider assuming that cache B hangs up, according to the mapping method described above, the only objects that will be affected are those that traverse the cache B clockwise until the next cache (cache C), which is the object mapped to cache B.
So here you only need to change the object Object4 and remap it to cache C; see Figure 4.
Figure 4 Cache Map after cache B has been removed
3.5.2 Add Cache
Consider the case of adding a new cache D, assuming that in this ring hash space, cache D is mapped between the object Object2 and Object3. The only things that will be affected are those objects that traverse the cache D counterclockwise until the next cache (cache B), which is also mapped to a portion of the object on cache C, to remap the objects to cache d.
So here you only need to change the object object2 and remap it to cache D; see Figure 5.
Figure 5 Mapping relationships after adding cache D
4 Virtual nodes
Another indicator for considering the Hash algorithm is the balance (Balance), which is defined as follows:
Balance of
Balance means that the result of the hash can be distributed to all buffers as much as possible, thus allowing all buffer space to be exploited.
Hash algorithm is not guaranteed absolute balance, if the cache is less, the object can not be evenly mapped to the cache, such as in the above example, only the deployment of cache A and cache C, in 4 objects, cache a only stored object1, Cache C Stores Object2, Object3, and Object4, and the distributions are very uneven.
To solve this situation, consistent hashing introduces the concept of "virtual node", which can be defined as follows:
Virtual node is the actual node in the hash space of the replica (replica), an actual node corresponding to a number of "virtual node", the corresponding number has become "Replication Number", "Virtual node" in the hash space to Hash value arrangement.
In the case of deploying only cache A and cache C, we have seen in Figure 4 that the cache distribution is not uniform. Now we introduce the virtual node, and set the "number of copies" to 2, which means there will be 4 "virtual nodes", the cache A1, cache A2 represents the cache A; Cache C1, Cache C2 represents the cache C; Suppose a more ideal situation See Figure 6.
Figure 6 Mapping relationship after the introduction of "Virtual Node"
At this point, the mapping of the object to the virtual node is:
Objec1->cache A2; objec2->cache A1; Objec3->cache C1; Objec4->cache C2;
So objects Object1 and Object2 are mapped to cache a, and object3 and Object4 are mapped to cache C; The balance has improved a lot.
After the "Virtual node" is introduced, the mapping relationship is transformed from {object---node} to {Object-and-virtual node}. The mapping relationship 7 is shown when querying the cache of an object.
Figure 7 The cache where the object is queried
The hash calculation of "virtual node" can be based on the IP address of the corresponding node plus the number suffix. For example, assume that the IP address of Cache A is 202.168.14.241.
Before introducing "Virtual node", calculate the hash value of cache A:
Hash ("202.168.14.241");
After introducing "virtual node", compute the hash value of the "virtual section" point cache A1 and cache A2:
Hash ("202.168.14.241#1"); Cache A1
Hash ("202.168.14.241#2"); Cache A2
======================================================================================================
Ref:http://blog.csdn.net/v_july_v/article/details/6879101
Consistent hashing algorithm