Hash and consistent hash

Source: Internet
Author: User

Hash: Use the hash function to establish a correspondence relationship between keywords and storage locations. In this way, the keywords are not compared one by one during the search process, and the location of the keywords is located directly, it is a way of exchanging space for time. Because the mapped address space is limited and the hash function is set, a conflict is generated and a method to handle the conflict needs to be established. In general, conflicts can only be minimized, but cannot be completely avoided. So what is a good hash? In layman's terms, a good hash may mean that the key word addresses are evenly distributed with less conflicts. The following describes the features of hash:

The hash algorithm must meet the following characteristics:

(1) Balance)
Balance refers to the distribution of hash addresses of keywords evenly in the address space to make full use of the address space. This is a basic feature of hash design.

(2) monotonicity)
Monotonic refers to a new address space that can be mapped to the hash address of the keyword obtained by the hash function when the address space increases, rather than limited to the original address space. Or when the address space is reduced, it can only be mapped to a valid address space. A simple hash function often does not meet this requirement. For example, the commonly used Division remainder partition, x = a mod P. In the above formula, P indicates the size of the address space. It is not hard to see that when the address space changes (from P1 to P2), all original hash results will change and thus do not meet the monotonicity requirements.

(3) spread)
Hash is often used in distributed environments. End users store their contents in different buffers through the hash function. In this case, the terminal may not be able to see all the buffers, but only some of them. When the terminal wants to map content to the buffer through the hash process, the buffering range seen by different terminals may be different, resulting in inconsistent hash results, the final result is that the same content is mapped to different buffers by different terminals. This situation should be avoided because it causes the same content to be stored in different buffers, reducing the system storage efficiency. Dispersion is defined as the severity of the above situation. A good hash algorithm should be able to avoid inconsistencies as much as possible, that is, to minimize dispersion.

(4) Load)
The load problem is actually a problem of decentralization from another perspective. Since different terminals may map the same content to different buffer zones, different users may map different content to a specific buffer zone. This situation should also be avoided like dispersibility. Therefore, a good hash algorithm should be able to reduce the buffer load as much as possible.

A common way to meet the preceding conditions is consistent hash, which is often used in the cache of Web applications. In large Web applications, cache is a standard development configuration today. A distributed cache system emerged in large-scale cache applications. You have heard about the basic principles of the distributed cache system. How can key-value be evenly distributed to the cluster? Speaking of this, the most common method is the hash modulo method. For example, if the number of available machines in the cluster is N, the data request with the key value K should be routed to the corresponding machine with the hash (k) mod n. Indeed, this structure is simple and practical. However, in some rapidly developing web systems, such solutions still have some defects. As the system access pressure increases, the cache system has to increase the speed and data carrying capacity of the cluster by adding machine nodes. Adding a machine means that, according to the hash modulo method, when a machine node is added, a large amount of cache is not in progress, and the cache data needs to be re-established, or even the overall cache data migration, in an instant, it will bring a very high system load to the dB, and the configuration will cause the DB server to go down. What should we do?

  Consistent hash. Selecting a specific machine node is not only dependent on the hash of the key that needs to cache data, but also on the machine Node itself..

(1) hash machine nodes
First, obtain the hash value of the machine node (how to calculate the hash value of the machine node? The IP address can be used as the hash parameter .. Of course there are other methods), and then distribute it to 0 ~ 2 ^ 32 on a ring (clockwise distribution ). As shown in:


 

Figure-1

There are four machines in the cluster (represented by a blue circle). We use a hash algorithm to distribute the machines to the ring shown in-1.

(2) Access Method

If there is a request to write data to the cache, where the key value is K, the hash value of the calculator Hash (K), and hash (k) correspond to a point in the graph-1 ring, if this point is not mapped to a specific machine node, search clockwise until the first time a node with a mapped machine is found, the node is the determined target node, if no node is found after 2 ^ 32, the first machine node is hit.

(3) add nodes

After adding a machine node (as shown in-2), the access policy remains unchanged and you can find that the access policy is still in the way in (2, only the keys on the first service node that is added to the ring in the counterclockwise direction will be affected. All other access nodes are normal.

Figure 2


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.