Memcached distributed algorithm-consistent hashing
Preface:
We know that data should be stored on M servers. The simplest way is to take the remainder (hash_value % m) and put it on the corresponding server, that is, when a server is added or removed, the cost of cache reorganization is huge. After a server is added, the remainder will change dramatically, so that the server that is the same as the storage cannot be obtained, thus affecting the cache hit rate.
The following article is very well written. Combined with the features of memcached, the consistent hasning algorithm can be used to create a very complete distributed cache server.
I am Nagano of Mixi.
This article will not introduce the internal structure of memcached, but will introduce the distributed memcached.
1 distributed memcached
Although memcached is called a "distributed" cache server, the server does not have a "distributed" function. The server only includes the memory storage features described in 2nd and 3rd Times, which is very simple to implement. The distributed memcached is fully implemented by the client library. This type of distribution is the biggest feature of memcached.
1.1 What does memcached's distributed architecture mean?
The term "distributed" is used many times, but it is not explained in detail. Now, let's briefly introduce the principle of the solution. The implementation of each client is basically the same.
Assume that the memcached server has node1 ~ Three node3 servers. The application will save the data with the key name "Tokyo" "Kanagawa" "CHBA" "Saitama" "Gunma.
Figure 1 distributed Introduction: Preparation
First, add "Tokyo" to memcached ". After "Tokyo" is passed to the client library, the algorithm implemented by the client determines the memcached server that stores data based on the "key. After the server is selected, run the command to save "Tokyo" and its value.
Figure 2 distributed Introduction: when adding
Similarly, "Kanagawa", "CHBA", "Saitama", and "Gunma" are both saved on the server first.
Next, obtain the saved data. The key "Tokyo" to be obtained is also passed to the function library. The function library uses the same algorithm as the data storage algorithm to select a server based on the "key. If the algorithm used is the same, you can select the same server as the storage server and then send the GET command. As long as the data is not deleted for some reason, you can get the saved value.
Figure 3 distributed Introduction: Get
In this way, memcached is distributed by storing different keys on different servers. When the number of memcached servers increases, keys will be dispersed. Even if a memcached server fails to connect, other caches will not be affected, and the system will continue to run.
Next we will introduce the distributed method of the Perl client function library Cache: Memcached implementation mentioned in 1st.
2 Cache: distributed memcached Method
Perl's memcached client function library cache: memcached is the work of Brad Fitzpatrick, creator of memcached. It can be said that it is the original function library.
· Cache: memcached-search.cpan.org
This function library implements distributed functions and is a standard distributed method for memcached.
2.1 scattered Based on remainder Calculation
Cache: The distributed method of Memcached is simply to say, "distribution based on the remainder of the number of servers ". Calculate the integer Hash Value of the key, divide it by the number of servers, and select the server based on the remaining number.
The Cache: Memcached is simplified to the following Perl script.
Use strict; use warnings; use String: CRC32; my @ nodes = ('node1', 'node2', 'node3 '); my @ keys = ('Tokyo ', 'kagawa ', 'kiba', 'saitama ', 'gunm'); foreach my $ key (@ keys) {my $ crc = crc32 ($ key ); # CRC nodes my $ mod = $ crc % ($ # nodes + 1); my $ server = $ nodes [$ mod]; # select the server printf "% s => % s \ n", $ key, $ server;} Based on the remainder ;}
Cache: Memcached uses CRC when calculating the hash value.
· String: CRC32-search.cpan.org
First, obtain the CRC value of the string. The server is determined by dividing the CRC value by the remainder of the number of server nodes. After the above code is executed, enter the following results:
tokyo => node2kanagawa => node3chiba => node2saitama =>node1gunma =>node1
According to this result, "tokyo" is distributed to node2, and "kanagawa" is distributed to node3. When the selected server cannot be connected, Cache: Memcached adds the number of connections to the key, computes the hash value again, and tries to connect. This action is called rehash. If you do not want rehash, you can specify the "rehash => 0" option when generating the Cache: Memcached object.
2.2 disadvantages of scattered calculation based on the remainder
The remainder calculation method is simple and data dispersion is excellent, but it also has its disadvantages. That is, when a server is added or removed, the cost of cache reorganization is huge. After a server is added, the remainder will change dramatically, so that the server that is the same as the storage cannot be obtained, thus affecting the cache hit rate. Use Perl to write code segments to verify the cost.
use strict;use warnings;use String::CRC32;my @nodes = @ARGV;my @keys = (’a’..’z');my %nodes;foreach my $key ( @keys ) {my $hash = crc32($key);my $mod = $hash % ( $#nodes + 1 );my $server = $nodes[ $mod ];push @{ $nodes{ $server } }, $key;}foreach my $node ( sort keys %nodes ) {printf “%s: %s\n”, $node, join “,”, @{ $nodes{$node} };}
This Perl script demonstrates how to save the key "a" to "z" to memcached and access it. Save it as mod. pl and execute it.
First, when there are only three servers:
$ mod.pl node1 node2 nod3node1: a,c,d,e,h,j,n,u,w,xnode2: g,i,k,l,p,r,s,ynode3: b,f,m,o,q,t,v,z
The result is as follows: node1 stores a, c, d, e ......, Node2 stores g, I, k ......, Each server stores 8 to 10 data records.
Next we will add a memcached server.
$ mod.pl node1 node2 node3 node4node1: d,f,m,o,t,vnode2: b,i,k,p,r,ynode3: e,g,l,n,u,wnode4: a,c,h,j,q,s,x,z
Node4. It can be seen that only d, I, k, p, r, and y are hit. In this way, after adding nodes, the keys distributed to the server will change dramatically. Only six of the 26 keys are accessing the original server, and all others are moved to other servers. The hit rate is reduced to 23%. When memcached is used in Web applications, the cache efficiency will be greatly reduced when the memcached server is added, and the load will be concentrated on the database server, which may cause failure to provide normal services.
This problem also exists in the use of mixi Web applications, resulting in the inability to add memcached servers. However, with the new distributed method, you can easily add memcached servers. This distributed method is called Consistent Hashing.
3 Consistent Hashing
The idea of Consistent Hashing has been introduced in many places, such as the Development blog of mixi Corporation. Here we only briefly describe it.
· MixiEngineers 'blog-wide spread across multiple kubernetes websites.
· ConsistentHashing
3. 1 simple description of Consistent Hashing
Consistent Hashing is as follows:
1) first obtain the hash value of the memcached server (node) and configure it to 0 ~ 232 of the circle (continuum.
2) then, use the same method to obtain the hash value of the key for storing the data and map it to the circle.
3) Search clockwise from the data ing location and save the data to the first server. If more than 232 still cannot find the server, it will be saved to the first memcached server.
Figure 4 Consistent Hashing: Basic Principle
Add a memcached server from the status. The residual number distributed algorithm affects the cache hit rate because the server that saves the key changes significantly. However, in Consistent Hashing, the keys on the first server that is added to the continuum server in a counterclockwise direction will be affected.
Figure 5 Consistent Hashing: Add a server
Therefore, Consistent Hashing minimizes key redistribution. In addition, some ConsistentHashing implementation methods also adopt the idea of virtual nodes. If a common hash function is used, the server's ing locations are unevenly distributed. Therefore, the idea of virtual nodes is used to allocate 100 ~ 200 points. In this way, the distribution will be restrained unevenly, minimizing the cache redistribution when servers increase or decrease.
The memcached client function library using the Consistent Hashing algorithm described below is tested by the number of servers (n) and the number of servers (m) the formula for calculating the hit rate after adding a server is as follows:
(1-n/(n + m) * 100
3. 2 function libraries supporting Consistent Hashing
Although Memcached does not support Consistent Hashing, several client function libraries support this new distributed algorithm. The first memcached client function library that supports Consistent Hashing and virtual nodes is the PHP library named libketama, developed by last. fm.
· Libketama-a consistent hashing algo for memcache clients-RJ without authorization-Users atLast. fm
As for the Perl client, Cache: Memcached: Fast and Cache: Memcached: libmemcached support for ConsistentHashing, which was introduced in the 1st serialization.
· Cache: Memcached: Fast-search.cpan.org
· Cache: Memcached: libmemcached-search.cpan.org
Both interfaces are almost the same as those of Cache: Memcached. If Cache: Memcached is being used, it can be conveniently replaced. Cache: Memcached: Fast implements libketama again. When using Consistent Hashing to create an object, you can specify the ketama_points option.
My $ memcached = Cache: Memcached: Fast-> new ({
Servers => ["192.168.0.1: 11211", "192.168.0.2: 11211"],
Ketama_points = & gt; 150
});
In addition, Cache: Memcached: libmemcached is a Perl module that uses the C function library libmemcached developed by BrainAker. Libmemcached supports several distributed algorithms and Consistent Hashing. Its Perl binding also supports Consistent Hashing.
· TangentSoftware: libmemcached
Summary
This article introduces the memcached distributed algorithm, which is mainly implemented by the client function library and the Consistent Hashing algorithm that efficiently disperses data. Next, we will introduce some of mixi's experience in memcached applications and related compatible applications.