Introduction to hash distribution and consistent Hash Algorithms

Source: Internet
Author: User
Tags crc32 crc32 checksum

Preface
In our daily web application development, memcached can be regarded as a standard development configuration today. I believe that the basic principles of memcache have also been learned. Although memcache is a Distributed Application Service, the distribution principle is determined by the client API, based on the storage key and known server list, the API stores the specified key on the corresponding server list based on the hash calculation of the key.

Basic Principles and distribution
Here, we usually use the method of getting the remainder Based on the hash value of the key % server number to determine the server to which the content of the current key is sent. A hashAlgorithmThe hash principle is explained in one sentence as a ing function between two sets. In our general application, we can basically understand it as a combination of any letters, numbers, and so on in set, here is a record in the key used for storage to find the corresponding record in Set B (for example, 0-2 ^ 32. (Problem: MD5 collision or conflict occurs here. That is to say, multiple a records are mapped to the same B record)

Practical Application
Obviously, in our applications, the records of integration A should be more evenly distributed in all the locations of the collection B, so as to avoid the distribution of our data to a single server, in the original memcached version of danga (Perl), The CRC32 algorithm is written using Java:
Private Static int origcompathashingalg (string key ){
Int hash = 0;
Char [] Carr = key. tochararray ();

For (INT I = 0; I <Carr. length; ++ I ){
Hash = (hash * 33) + Carr [I];
}

Return hash;
}
There is another improved algorithm:
Private Static int newcompathashingalg (string key ){
CRC32 checksum = new CRC32 ();
Checksum. Update (key. getbytes ());
Int CRC = (INT) checksum. getvalue ();

Return (CRC> 16) & 0x7fff;
}

Distribution Rate Test
Some tests have been conducted. When the number of randomly selected keys on the server is 5, the probability of all keys distributed in the server group is:
Origcompathashingalg:
0 10%
1 9%
2 60%
3 9%
4 9%

Newcompathashingalg:
0 19%
1 19%
2 20%
3 20%
4 20%

Obviously, using the old CRC32 algorithm will lead to a higher load for the third memcached service, but using the new algorithm will make the load between services more balanced.
Other common hash algorithms include FNV-1a hash, RS hash, JS hash, pjw hash, elf hash, AP hash and so on. For more information about shoes, see:

Http://www.partow.net/programming/hashfunctions/

Http://hi.baidu.com/algorithms/blog/item/79caabee879ece2a2cf53440.html

Summary
So far, we have learned that in our applications, we should try our best to make our ing more evenly distributed, so as to make the service load more reasonable and balanced.

New Problem
So far, we have been faced with the following problem: when the service instance itself changes, the Service list changes and a large number of cache data requests will miss, almost all data needs to be migrated to another service instance. In this way, when a large service is online, the instantaneous pressure on the backend database/hard disk may lead to the crash of the entire service.

Consistent hash(Consistent hashing)
Here, we adopt a new method to solve the problem. The server selection does not only depend on the hash of the key, but also performs the hash operation on the configuration of the service instance (node.

    1. First, obtain the hash of each service node and configure it to a 0 ~ 2 ^ 32 ring (continuum) range.
    2. Next, use the same method to find the hash of the key you want to store, and configure it to this ring (continuum.
    3. Search clockwise from the location where the data is mapped, and save the data to the first service node. If no service node is found after 2 ^ 32, it is saved to the first memcached service node.

The legend of the entire data:


When a service node is added:

Others: only the keys on the first service node that is added to the ring in the counterclockwise direction will be affected.

Summary
The consistency hash algorithm avoids key redistribution on the service node list to the maximum extent. Other improvements include the addition of virtual service nodes in some consistency hashing algorithms, that is, a service node has multiple ing points on the ring, which can suppress uneven distribution,
Minimizes the cache redistribution when service nodes increase or decrease.

application
in the actual application of memcached, although the official version does not support consistent hashing, however, with the implementation of consistent hashing and virtual nodes, the first implementation is last. libketama developed by FM (a popular music platform outside China).
the Java version of the hash part called (based on MD5 ):
/**
* calculates the ketama hash value for a string
* @ Param S
* @ return
*/
Public static long md5hashingalg (string key) {

If (MD5 = NULL) {
try {
MD5 = messagedigest. getinstance ("MD5");
}catch (nosuchalgorithmexception e) {
log. error ("+++ no MD5 algorythm found");
throw new illegalstateexception ("+++ no MD5 algorythm found ");
}< BR >}

Md5.reset ();
Md5.update (key. getbytes ());
Byte [] bkey = md5.digest ();
Long res = (long) (bkey [3] & 0xff) <24) | (long) (bkey [2] & 0xff) <16) | (long) (bkey [1] & 0xff) <8) | (long) (bkey [0] & 0xff );
Return res;
}
In a consistent hash algorithm/policy environment, the time and distribution are both satisfactory in terms of test. See:
Http://www.javaeye.com/topic/346682

Summary
In the distributed cache system of our web development applications, the hash algorithm bears the key points in the system architecture.
By using a more reasonable distribution algorithm, load balancing among multiple service nodes can be achieved, avoiding resource waste and server overload to the greatest extent.
Using consistent hash algorithms can minimize the costs and risks of data migration caused by changes in the service hardware environment.
With more reasonable configuration policies and algorithms, the distributed cache system can serve our entire application more efficiently and stably.

Outlook
The algorithm/policy of consistent hash comes from P2P networks. In fact, in many P2P network application scenarios, there are many similarities with our applications, so we can learn from them.
ReferenceArticle:
Comparison and Analysis of mainstream distributed Hash Algorithms in P2P Networks
Http://www.ppcn.net/n3443c38.aspx

Other references:
Http://www.taiwanren.com/blog/article.asp? Id = 840
Http://www.iwms.net/n923c43.aspx
Http://tech.idv2.com/2008/07/24/memcached-004/

RelatedCode:
Ketama code of last. fm
Http://static.last.fm/rj/ketama.tar.bz2

Consistent hashing implementation in PHP: flexihash
Home page:
Http://paul.annesley.cc/articles/2008/04/flexihash-consistent-hashing-php
Code homepage on Google Code:
Http://code.google.com/p/flexihash/
Now I have moved to GitHub:
Http://github.com/pda/flexihash/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.