Distributed Cache Technology Memcached Learning Series (IV.)--the principle of consistent hash algorithm

Source: Internet
Author: User
Tags modulus

Article Home Directory

    • Introduction to distributed consistent hash algorithm
    • Distributed consistent hash algorithm using background
    • Ring Hash Space
    • Map key to ring hash space
    • Mapping server nodes to hash space
    • Mapping Key to server node
    • Add server node
    • Remove the server node
    • Introduction of Virtual Nodes
    • The problem of node change data shunt
    • A comparison between consistent hash algorithm and modulus algorithm
    • Reference documents
Back to Top Introduction to distributed consistent hash algorithm

When you see the word "distributed consistency hash algorithm", the first time you may ask, what is distributed, what is consistency, and what is a hash. Before analyzing the principles of distributed consistent hash algorithms, let's look at some of these concepts first.

Distributed

Distributed (distributed) refers to the deployment of different service modules in multiple different servers, which work together through remote calls and provide services externally.

Existing systems system, with Modela, Modelb, MODELC and other service modules. Now to deploy in a centralized (clustered, cluster) and distributed manner, let's look at what they're deployed.

Graph-Focused Deployment intent

Figure Distributed Deployment

From the above centralized deployment intent and distributed deployment we can see that the centralized deployment of all the service modules of a system to a different server, constitute a cluster, through the load balancing device to provide services externally. Centralized deployment is like a pantry with multiple water dispenser services and a redundant deployment of services. Distributed deployment splits the system into different service modules and then deploys different service modules on different servers.

As we can see, distributed deployment is not only distributed services, but also distributed data storage, distributed static resources, distributed computing and so on. At this point, perhaps you have recalled the mention, Memcached is not a set of distributed cache system. Yes, that's right, memcached. Distributed data storage, "distributed" in the "distributed consistent hash algorithm" refers to the distribution of the cached data.

Consistency

Once you understand the distribution, consistency is a good idea. Distributed data storage data, it is inseparable from the distributed extraction data. Consistent hash can guarantee that in distributed environment, the result of the hash of key or the mapping between key and node is not affected by the increment and deletion of nodes. The definition of consistent hash in the reference wiki:

Consistent hashing is a special kind of hashing such this when a hash table was resized, only k/n keys need to be remapped On average, where K was the number of keys, and N is the number of slots. In contrast, with most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped Because the mapping between the keys and the slots are defined by a modular operation.

It probably means, "the consistent hash is a special hashing algorithm that provides a hash table that, when resized, averages only partial (k/n) keys that need to remap the hash slots, rather than almost all of the key needs to remap the hash slots as the traditional hash table does."

Hash

Hash, commonly known as "hash", also known as hash, is an algorithm that compresses any length of message (data) to a fixed-length message digest (data). The common hash algorithm has Md5,sha and so on. The hash algorithm has several important characteristics: irreversibility (that is, it is impossible to reverse the original message from the hash value), conflict resistance (that is, given message M1, there is no other message M2, so hash (M1) =hash (M2)) and distribution uniformity (that is, the result of the hash is evenly distributed). Memcached, the hash map is required to access the data. It is these features that ensure that the key in the memcached cache is unique.

Three words have been introduced, that memcached why to use the distributed consistent hash algorithm, continue to see the following.

Back to Top distributed consistent hash algorithm using background

We already know that the distribution of memcached is mainly based on the client's distributed algorithm. The memcached client is like a route in a network that, through a specific algorithm, disperses the data to the machine on the memcached server and extracts the data from the machine on the distributed memcached server. In practice, the common algorithm of storing and extracting data has the algorithm of modulus and the consistency hash algorithm.

The principle of modulus algorithm is:

Hash (key)%N

Where key represents the key of the data and represents the number of memcached servers. The result of the modulo is the memcached server to which the Memcached client is located. The modulus algorithm is very obvious, the result is very easy to be affected by N, when the number of servers n increase or decrease, the original cache data location is almost invalid, the cache data location failure means to the database re-query, which for high concurrency system is fatal. As a result, the consistent hash algorithm is proposed, and the ultimate goal is to minimize the positioning effect of existing cache data when removing and adding a memcached server.

Introduction and use of distributed consistent hash algorithm the background has already been introduced, presumably you are not familiar with the word "distributed consistency hash algorithm", the following will open our "distributed consistent hash algorithm" principle of the explanation.

Back to Top Ring Hash Space

Typically, a key of a cached data is hashed to get a 32-bit value, which is the range of 0~2^32-1 values. We can abstract this range of values into a ring-and-tail space, and we call this space a circular hash space. As shown in the following:

Figure Ring Hash Space

Back to Top map key to ring hash space

With the ring hash space, the key of the cached data is mapped to the ring hash space after the hash value is obtained. Suppose there are Key1, Key2, Key3, Key4, after the hash, mapped to the ring hash space as shown:

Figure key mapped to ring hash space

Back to Top mapping server nodes to hash space

In the same vein, we can abstract the memcached server into the loop hash space after a hash of the nodes on the network. Suppose there are server1 (can be some unique flag information of the server, such as IP, etc.), Server2, Server3, after the hash, mapped to the ring hash space as shown:

Diagram server node mapped to ring hash space

Back to Top mapping key to server node

Now that both the cache key and the server node are mapped to the ring hash space through a consistent hash algorithm, it is now possible to map the relationship between the cache key and the server node. Clockwise along the ring hash space, starting with a cache key, until a server node is encountered, then the cache key is stored on this server node.

Figure key maps to the server node

After understanding the mapping between key, server node, and hash space, we now know how the cached data is distributed to the memcached server. When looking for cached data, the same mapping method is used to locate it.

Back to Top Add server node

Now we know the strategy of memcached to store and access data. What is the impact of the hit rate on data access when adding a server node to the server cluster? For example, I add a node server4 between the Server1 and Server2 nodes.

Figure adding SERVER4 Nodes

As can be seen, after increasing the SERVER4 node, the original cache data distribution, only the data of the SERVER1~SERVER4 node is re-distributed, this part of the data needs to be re-mapped to the database lookup again to the newly added SERVER4 node. Although the cache data that cannot be hit still exists, the hash key redistribution has been minimized with respect to the modulo algorithm.

Back to Top Remove the server node

Similarly, when the Server2 node is deleted in the server cluster, only the cached data between the server1~server2 is affected, and this data needs to be re-mapped to the SERVER3 node on the database lookup. As shown in the following:

Figure Delete Server2 node

Back to Top introduction of Virtual nodes

We already know that adding and removing nodes affects the distribution of cached data. Although hash algorithms are evenly distributed, they may not be evenly distributed across the ring when the number of servers in the cluster is small, resulting in the inability to distribute the cached data evenly across all servers. To solve this problem, we need to use the idea of virtual node: allocating 100~200 points on the ring for each physical node (server), so that the nodes on the ring are more, it can suppress the uneven distribution. When locating the target server for the cache, if you navigate to the virtual node, it means that the cache's true storage location is on the actual physical server represented by the virtual node. In addition, if the load capacity of each actual server node is different, you can assign different weights, assigning different numbers of virtual nodes according to the weights.

The hash calculation of the virtual node can be based on the IP address of the corresponding node plus the digital suffix method. For example, assume that the IP address of the ServerA is 127.0.0.1. Before introducing a virtual node, calculate the hash value of the ServerA:

Hash ("127.0.0.1");

After introducing the virtual node, compute the hash value of the virtual node serverA1 and serverA12: hash ("127.0.0.1#1");

Hash ("127.0.0.1#2");

Back to Top the problem of node change data shunt

The node changes discussed above will result in the redistribution of partially cached data, and the hash algorithm has an important measure: the results of the hash algorithm ensure that the cached data that needs to be re-distributed can be mapped to the new server node.

Back to Top A comparison between consistent hash algorithm and modulus algorithm

The method of modulus algorithm is simple, the dispersion of data is also possible, but its main disadvantage is that when the server node is added or removed, the cost of cache remapping is quite large. When you add or remove a server node, the remainder changes dramatically so that you cannot locate the same server node as you stored, affecting the cache hit ratio. The consistent hash algorithm minimizes the impact of server node changes, and when the node changes, it affects only a subset of the server node data, and the hash algorithm ensures that the cached data that needs to be re-distributed can be mapped to the new server node.

Back to Top Reference Documents

http://blog.csdn.net/sparkliang/article/details/5279393

Http://www.blogjava.net/hao446tian/archive/2013/01/29/394858.html

http://www.dexcoder.com/selfly/article/2388

Http://www.cnblogs.com/lintong/p/4383427.html

http://blog.csdn.net/fdipzone/article/details/7170045

http://blog.jobbole.com/95588/




Itpsc
Source: http://www.cnblogs.com/hjwublog/
Tips: If you feel that reading this article can make you gain, please click on the " recommended " button or " Follow me " button, your affirmation will be my motivation to write! Welcome reprint, reproduced please indicate the source !

Distributed Cache Technology Memcached Learning Series (IV.)--the principle of consistent hash algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.