Redis Distributed algorithm-consistent hashing (consistent hash)

Source: Internet
Author: User
Tags redis server

The traditional distributed algorithm

Before understanding the Redis distributed algorithm, it is better to understand the application scenario in the cache, after understanding the application scenario, then to understand the consistency hashing algorithm, it is much easier, and more can reflect the advantages of a consistent hashing algorithm, then we first describe this classic distributed cache application scenario.

Scenario Description:

Assuming we have three cache servers for caching images, we have numbered No. 0, 1th, and 2nd for these three cache servers, and now there are 30,000 images that need to be cached, and we want these images to be evenly cached on these 3 servers so that they can share the pressure of the cache. In other words, we want each server to be able to cache about 10,000 images, so what should we do? If we don't have any regular cache of 30,000 images on the average of 3 servers, can we meet our requirements? OK! But if you do this, when we need to access a cache entry, we need to traverse 3 cache servers, find the cache we need to access from 30,000 cache entries, the process is too inefficient and too long, and when we find the cache entry that needs to be accessed, the length of time may not be received, and it loses the meaning of the cache. The purpose of the cache is to improve speed, improve user experience, reduce back-end server pressure, if every time access to a cache entry need to traverse all cache server caching items, think it is very tired, then, what should we do? The original practice is to hash the key of the cache entry, the result of the hash of the cache server to the number of modulo operation, through the result of the modulo, determine which cache items will be cached on which server, it may not be easy to understand, we illustrate, we still take the scenario described earlier, for example, Suppose we use the image name as the key to access the image, assuming that the image name is not duplicated, then we can use the following formula to calculate which server the picture should be stored on:

hash (picture name)% N

Because the name of the picture is not duplicated, so, when we do the same image name to do the same hash calculation, the results should be unchanged, if we have 3 servers, using the result of the hash to 3 redundancy, then the remainder must be 0, 1 or 2, yes, exactly the same as our previous server number, If the result of redundancy is 0, we will be the current picture name corresponding to the picture cached on server No. 0, if the remainder is 1, the current picture name corresponding to the picture cached on the 1th server, if the remainder of 2, the same, then, when we visit any of the pictures, as long as the image name to do the above operation, Can be drawn to the corresponding picture should be stored on which cache server, we only need to find the image on this server, if the image does not exist on the corresponding server, it proves that the corresponding picture is not cached, and do not have to traverse other cache server, through such a method, Can be 30,000 pictures randomly distributed to 3 cache servers, and the next time to access a picture, directly can determine the picture should exist on which cache server, so that can meet our needs, we temporarily call the above algorithm as a hash algorithm or modulo algorithm, the process of the modulus algorithm can be expressed:

However, when using the above hash algorithm to cache, there will be some defects, imagine, if 3 cache server is not enough to meet our caching needs, then what should we do? Yes, very simple, add more than two cache server not on the line, assuming that we added a cache server, then the number of cache servers from 3 to 4, at this time, if you still use the above method to cache the same picture, Then this picture of the server number must be different from the original server number 3, because the divisor from 3 to 4, the divisor is constant, the remainder is certainly different, the result is that when the number of servers changes, all the location of the cache will change, in other words, When the number of servers is changed, all the delay is invalidated for a certain period of time, when the application is unable to fetch data from the cache, the data is requested from the back-end server, similarly, assuming that there is a sudden failure of a cache server in the 3 cache, unable to cache, then we need to remove the faulty machine, However, if you remove a cache server, the number of cache servers from 3 to 2, if you want to access a picture, the cache location of the image will inevitably change, the previous cached image will also lose the role and meaning of the cache, because a large number of caches at the same time, resulting in the cache avalanche, At this point, the front-end cache has been unable to take part of the pressure, the backend server will be under great pressure, the whole system is likely to be crushed, so we should try to avoid this situation, but because of the hash algorithm itself, using the modulo method to cache, this situation is unavoidable, In order to solve these problems, the consistent hashing algorithm was born.

Let's review the problems that occur with the above algorithm:

    • Issue 1: When the number of cache servers changes, it can cause a cache avalanche, which may cause the overall system to become too stressed and crash (a large number of caches expire at the same time).
    • 2: When the number of cache servers changes, almost all cache locations will change, how can the affected cache be minimized?

In fact, the above two problems are a problem, then the consistency hashing algorithm can solve the above problem? Let's take a look at the consistent hashing algorithm.

Redis Distributed Algorithms

Redis uses the consistent hashing algorithm:

    • Consistent hashing is a consistent hash algorithm
    • The consistent hashing algorithm was put forward in the paper "consistent hashing and random trees" as early as 1997.

This algorithm has the concept of a ring hash space, let us first look at the ring hash Space:

    • Usually hash algorithm is to map value in a 32-bit key value, then the number of the first and last phase will form a circle, the value range is 0 ~ 2^32-1, this circle is the ring hash space. Such as:

Let's take a look at how to map objects to the ring hash space:

    • Consider only 4 objects Object1 ~ OBJECT4
    • First, the hash function of the four objects to calculate the hash value key, the hash value of these objects will fall in the above ring hash space range, the object's hash corresponds to the ring hash space of the key value then the object is mapped to that position, So the object is mapped to the ring hash space. Such as:

Then the cache is mapped to the ring hash space, and the cache is our Redis server:

    • The basic idea is that both the object and the cache are mapped to the same hash ring space, and the same hash algorithm is used for the calculation, and the pseudo code is represented as follows:
hash(cache A) = key A;... ...hash(cache C) = key C;

After calculating the hash value of the cache, it is mapped to the corresponding key in the hash ring space as the object, and the image is represented as follows:

As you can see, both the cache and the obejct are mapped to this ring hash space, so the next thing to consider is how to map the object to the cache. In fact, in this circular hash space for a clockwise calculation, for example, the first cache key1 clockwise encountered is Cachea, so the Key1 map to Cachea, Key2 the first cache encountered clockwise is CACHEC, Then map the Key2 to Cachec, and so on. Such as:

If one of the caches is removed, then object continues to look for the next cache in a clockwise direction. For example, if Cacheb is removed, the Object4 mapped in Cacheb will find cachec clockwise and mapped to CACHEC. Such as:

So when removing a Cacheb the object range that is affected is the range between Cacheb and Cachea, which is relatively small. As indicated in the scope:

The same is true when adding a cache node, for example, when a cached node is added between Achec and Cacheb, the first cache that Object2 encounters in a clockwise direction is cached, and obejct2 is mapped to cached. Such as:

Similarly, the extent to which the cache node is affected is the range between cached and Cacheb. As indicated in the scope:

What we have explained above is the ideal situation, we assume that all the cache nodes are evenly distributed in the ring hash space. But we all know that the ideal is very plump reality is very bony, like the seller show and the buyer's show, there will always be inconsistent with the physical situation.

Therefore, it is possible that the cache node can not be evenly spaced in the ring hash space. Such as:

Can be seen, a, B, C nodes are squeezed in a piece, according to the clockwise to calculate, there will be a large number of data (object) mapped to a node, from which there will be more than half data are mapped to a node, then a node of the data pressure will be very large, B, c node can not be used very well, Almost the same as idle nothing to do. This is the result of the hash tilt, can not be guaranteed in the ring hash space is absolutely evenly distributed.

In order to solve the problem of hash skew, Redis introduces the concept of virtual node, which is equivalent to a shadow or a clone of the actual node, and the virtual node is usually more than the actual number of nodes. When a virtual node is introduced, object is no longer mapped directly to the actual cache node, but is mapped to the virtual node first. The virtual node then makes a hash calculation and finally maps to the actual cache node. So a virtual node is an amplification of our actual nodes, such as:

Put it in the ring hash space to represent, that is, the light is a virtual node, dark color for the actual node:

As can be seen above, this distribution is uniform. This is just as a demonstration, the actual node will be more, the virtual node and the actual node there is a certain proportion. And as the actual nodes increase, the distribution on the ring hash space becomes more and more uniform, and the smaller the effect will be when the cache is removed or added.

Consistent hashing hit rate calculation formula:

(1 - n / (n + m)) * 100%

n = number of existing nodes
m = Number of new nodes

Redis Distributed algorithm-consistent hashing (consistent hash)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.