This blog is written because of a previous interview question:
What happens to data on other machines if the memcached cluster needs to increase the machine or reduce the machine?
Finally learned that the use of consistent hash algorithm can be solved, next to learn it.
Statements and Acknowledgements:
This article is reproduced in Zhu Dian Blogger's personal log, "vernacular parsing: consistent hashing algorithm consistent hashing" article. A. Intro
Before understanding the consistent hashing algorithm, it is better to understand the cache of a scenario, after understanding the application scenario, then to understand the consistency hashing algorithm, it is much easier, and more can reflect the advantages of the consistent hashing algorithm, then we first describe this classic distributed cache application scenario.
1. Scene description
Assuming we have three cache servers for caching images, we have numbered No. 0, 1th, and 2nd for these three cache servers, and now there are 30,000 images that need to be cached, and we want these images to be evenly cached on these 3 servers so that they can share the pressure of the cache. In other words, we want each server to be able to cache about 10,000 images, so what should we do? If we don't have any regular cache of 30,000 images on the average of 3 servers, can we meet our requirements? OK! But if you do this, when we need to access a cache entry, we need to traverse 3 cache servers, find the cache we need to access from 30,000 cache entries, the process is too inefficient and too long, and when we find the cache entry that needs to be accessed, the length of time may not be received, and it loses the meaning of the cache. The purpose of the cache is to improve speed, improve user experience, reduce back-end server pressure, if every time access to a cache entry need to traverse all cache server caching items, think it is very tired, then, what should we do? The original practice is to hash the key of the cache entry, the result of the hash of the cache server to the number of modulo operation, through the result of the modulo, determine which cache items will be cached on which server, it may not be easy to understand, we illustrate, we still take the scenario described earlier, for example, Suppose we use the image name as the key to access the image, assuming that the image name is not duplicated, then we can use the following formula to calculate which server the picture should be stored on.
hash (picture name)% N
Because the name of the picture is not duplicated, so, when we do the same image name to do the same hash calculation, the results should be unchanged, if we have 3 servers, using the result of the hash to 3 redundancy, then the remainder must be 0, 1 or 2, yes, exactly the same as our previous server number, If the result of redundancy is 0, we will be the current picture name corresponding to the picture cached on server No. 0, if the remainder is 1, the current picture name corresponding to the picture cached on the 1th server, if the remainder of 2, the same, then, when we visit any of the pictures, as long as the image name to do the above operation, Can be drawn to the corresponding picture should be stored on which cache server, we only need to find the image on this server, if the image does not exist on the corresponding server, it proves that the corresponding picture is not cached, and do not have to traverse other cache server, through such a method, Can be 30,000 pictures randomly distributed to 3 cache servers, and the next time to access a picture, directly can determine the picture should exist on which cache server, so that can meet our needs, we temporarily call the above algorithm as a hash algorithm or modulus algorithm, the process of the modulus algorithm can be expressed.
However, when using the above hash algorithm to cache, there will be some defects, imagine, if 3 cache server is not enough to meet our caching needs, then what should we do? Yes, very simple, add more than two cache server not on the line, assuming that we added a cache server, then the number of cache servers from 3 to 4, at this time, if you still use the above method to cache the same picture, Then this picture of the server number must be different from the original server number 3, because the divisor from 3 to 4, the divisor is constant, the remainder is certainly different, the result is that when the number of servers changes, all the location of the cache will change, in other words, When the number of servers is changed, all the delay is invalidated for a certain period of time, when the application is unable to fetch data from the cache, the data is requested from the back-end server, similarly, assuming that there is a sudden failure of a cache server in the 3 cache, unable to cache, then we need to remove the faulty machine, However, if you remove a cache server, the number of cache servers from 3 to 2, if you want to access a picture, the cache location of the image will inevitably change, the previous cached image will also lose the role and meaning of the cache, because a large number of caches expire at the same time, resulting in a cache avalanche , the front-end cache has been unable to take part of the pressure, the backend server will be under great pressure, the whole system is likely to be crushed, so we should try to avoid this situation, but because of the above hash algorithm itself, using the modulo method to cache, this situation is unavoidable, In order to solve these problems, the consistent hashing algorithm was born.
Let's review the problems that occur with the above algorithm:
Issue 1: When the number of cache servers changes, it can cause a cache avalanche, which may cause the overall system to become too stressed and crash (a large number of caches expire at the same time).
2: When the number of cache servers changes, almost all cache locations will change, how can the affected cache be minimized?
In fact, the above two problems are a problem, then the consistency hashing algorithm can solve the above problem? Let's take a look at the consistent hashing algorithm. Two. Basic concepts of a consistent hashing algorithm
In fact, the consistent hashing algorithm is also the use of the method of modulus, but just described by the Modulus method is the number of servers to take the model, and the consistency of the hash algorithm is the 2^32 to take the model, what meaning? Let's talk slowly.
First, we think of two of 32 times as a circle, like a clock, the circle of a clock can be interpreted as a circle of 60 points, and here we think of this circle as a circle composed of 2^32 points, as follows:
The point above the ring represents 0, the first point to the right of the 0 point represents 1, and so on, 2, 3, 4, 5, 6 ... Until 2^32-1, which means that the first point to the left of the 0 point represents 2^32-1
We call this circle of 2 of 32 square dots called the hash ring .
So what does a consistent hash algorithm have to do with a ring in? Let's continue with the scenario described earlier, assuming we have 3 cache servers, server A, Server B, Server C, in the production environment, the three servers must have their own IP addresses, we use their respective IP addresses for hashing, using the results of the hash to 2^32 the model, You can use the following formula to indicate.
Hash (IP address of Server a)% 2^32
The result calculated by the above formula must be a 0 to 2^32-1 an integer, we use the calculated integer, representing server A, since this integer must be between 0 and 2^32-1, then, in the hash ring must have a point with this integer corresponding, and we have just explained that Using this integer to represent server A, server A can map to this ring, using a schematic:
Similarly, Server B and server C can be mapped to the hash ring in the same way:
Hash (IP address of server B)% 2^32
Hash (IP address of server C)% 2^32
Using the above method, Server B and server C can be mapped to the hash ring, as follows:
Suppose that 3 servers are mapped to the hash ring as shown later (of course, this is the ideal situation for us to talk slowly).
Well, so far, we have put the cache server and hash ring together, we use the above method to map the cache server to the hash ring, then using the same method, we can also map the objects that need to be cached to the hash ring.
Assuming that we need to cache the image using the cache server, and we still use the name of the image as the key to find the image, we use the following formula to map the image to the hash ring in.
hash (picture name)% 2^32
Following the map, the Orange circle in the picture shows:
OK, now the server and picture are mapped to the hash ring, then the picture in the end should be cached to which server? The image will be cached on server A, for what? Because the first server encountered in the clockwise direction from the position of the picture is a server, the picture in will be cached on server A as shown in.
Yes, the consistent hashing algorithm is this way to determine which server an object should be cached on, the cache server and the cached object mapped to the hash ring, from the location of the cached object, the first server in a clockwise direction, is the current object will be cached in the server, Because the cached object and the server hash value is fixed, so, in the case of the server is not changed, a picture must be cached to a fixed server, then, the next time you want to access this image, as long as the same algorithm for the calculation, you can calculate the image is cached on which server, Go directly to the corresponding server to find the corresponding image. /FONT>
Just now the example uses only one picture to demonstrate, assuming that there are four images that need to be cached, as follows:
Pictures 1th and 2nd will be cached on Server A and 3rd will be cached on Server B, and picture 4th will be cached on Server C. Three. Advantages of the consistent hashing algorithm
After the above description, I think you should have understood the principle of the consistent hashing algorithm, but again, the consistency hashing algorithm can solve the problem before, we have said that if the number of servers is simply modeled, then when the number of servers change, will generate a cache avalanche, This is likely to cause a system crash, so can you avoid this problem with a consistent hashing algorithm? Let's simulate it and get an answer.
Assuming that Server B fails and we now need to remove Server B, we will remove Server B from the hash ring and remove Server B as follows.
When server B is not removed, picture 3 should be cached in Server B, but when Server B is removed, according to the rules of the consistent hashing algorithm described previously, picture 3 should be cached in Server C, because from the location of picture 3, the first cache server node in the clockwise direction is Server C, In other words, if server B fails to be removed, the cache location of picture 3 changes.
However, picture 4 will still be cached in Server C, picture 1 and Picture 2 will still be cached to server A, which is not any different from server B before removing, this is the advantage of a consistent hashing algorithm, if the previous hash algorithm, the number of servers changed, All caches of all servers fail at the same time, and when a consistent hashing algorithm is used, the number of servers is changed, not all caches are invalidated, but only partial caches are invalidated, and the front-end cache can still share the pressure of the entire system. Instead of all the pressure at the same time to focus on the back-end server.
This is the advantage of the consistent hashing algorithm. Four. Skew of the hash ring
In introducing the concept of consistent hashing, we idealized the mapping of 3 servers evenly to the hash ring, as shown in:
However, the ideal is very plump, the reality is very bony, we imagined and the actual situation is often different.
In the actual mapping, the server may be mapped as follows.
Smart as you must have thought, if the server is mapped into a pattern, then the cached object is most likely to be centrally slow on a server, as shown in.
, 1th, 2nd, 3rd, 4th, 6th pictures are slow to exist on server A, only the 5th picture is slow to exist on server B, Server C even do not cache any pictures, if present, A, B, c three servers are not reasonably average full use, the cache distribution is extremely uneven, and , if server A fails at this time, then the number of failed caches will also reach the maximum, in the extreme case, still may cause the system crash, in the case is called the hash ring skew, then, how should we prevent the hash ring skew it? Using the "virtual node" in the consistent hash algorithm solves this problem and we continue to talk. Five. Virtual node
Then, since we have only 3 servers, when we map the server to the hash ring, there is a possibility of hash ring skew, when the hash ring skew, the cache will often be extremely uneven distribution on the server, smart as you must have thought of, If you want to evenly distribute the cache to 3 servers, it is best to allow the 3 servers as much as possible, evenly on the hash ring, but the real server resources only 3, how we let them come out of thin air, yes, it is out of the server node to get more up, Since there is no redundant real physical server node, we can only copy the existing physical nodes through the virtual method, the nodes that are virtual replicated by the actual nodes are called "virtual nodes". After joining the virtual node, the hash ring is as follows.
A "virtual node" is a copy of the "Actual node" (the actual physical server) on the hash ring, and an actual node can correspond to multiple virtual nodes.
As can be seen, a, B, c three servers virtual out of a virtual node, of course, if you need, you can also virtual more virtual nodes. Introduction of the concept of virtual node, the distribution of the cache is much more balanced, in, 1th, 3rd pictures are cached in Server A, 5th, 4th pictures are cached in Server B, 6th, 2nd pictures are cached in Server C, if you are not at ease, you can virtual more virtual nodes, In order to reduce the effect of hash ring skew, the more virtual nodes, the more nodes on the hash ring, the greater the probability of the cache being evenly distributed.
General solution for cluster expansion: consistent hash algorithm