Read word for word "large distributed Web site architecture design and Practice," the 2nd Chapter, meaning still not do! As the title says, this is a "real material of distributed data", and I have seen the distributed books (such as "large web systems and Java Middleware Practice"), the book Re-technology merger theory, gave the new direction.
I am most touched by the book introduces a lot of distributed "dry": Distributed cache can be used memcache, database horizontal/Vertical splitting technology, distributed storage can hbase/redis, message channel can be used ACTIVEMQ, search engine LUCENE/SOLR and so on. Of course, every technology is not a book can be said, the author at least gave us the direction of learning, I am very touched.
This is only the 2nd chapter, there will be I am very interested in the log processing, data Warehouse, load balancing, etc., if not winning, I will also buy a study. This is a suitable for beginners and advanced books, has introduced the theory of distributed, but also the introduction of each technology to use the specific implementation method, can let us less detours.
The book describes the consistency hash algorithm (consistent hashing), I personally also admire this classic algorithm, so later on it to introduce some summary.
Background: The so-called distributed cache does not mean that each server stores all of the caches, but rather lets each server evenly store the cache. To realize the server's sharing and update will pay attention to two points:① with what algorithm to divide to quickly find the specified cache; ② It is also critical to synchronize cache changes to the corresponding server when a cache is added and deleted.
Traditional algorithm: The most traditional algorithm is the "total number of Cache/server number" in the form of evenly divided, each cache has its own unique hash value (number), to find out which server the specified cache can be determined by hash (key)%n the form of redundancy, so that the basic can achieve balance.
Traditional algorithm lookup cache run mode as shown, when a client needs a cache, the first time a cache index (diamond structure) is issued a request, when the indexer through the algorithm (hash (key)%N) Take the form of the remainder of the cache server to find the corresponding buffer and get the cache return.
This algorithm can reach the first condition of the cache, "what algorithm to use to divide to quickly find the specified cache", but on the second condition hit the wall "when the cache appears new, how to delete the changes in the cache synchronization to the corresponding server." Assuming Server1 suddenly goes down, all caches on the server fail, and the cache is recalculated and averaged over the remaining servers. resetting the cache takes time , and if there is a user request that cannot be cached, instead of reading the data from the data source, a large wave of data requests constantly hitting the server, which is likely to cause a system crash, which is known as the "avalanche effect."
Consistent hashing (consistent hashing): The consistent hashing algorithm was proposed by MIT in 1997, designed to solve hot issues in the Internet, and is now highly affirmed in cluster cache applications.
The theoretical principle of consistent hashing is shown.
>> first to have a first ring, the ring size is scientifically based: generally a cache object hash value is 32 bits, then theoretically there can be 2 of the 32 square (that is, from the digital 0~2 32 square 1) different hash value, Then only the 0 and 2 of the 32-1 end of the first connection to become a digital ring .
>> assuming that there are 9 caches , the hash is calculated as a number , then the hash value is placed on the ring corresponding to the digital point, so that 9 caches have their own location (such as the Pink dot is the cache hash).
>> Assuming there are 4 cache servers, we also generate 32-bit numbers for the server to hash, and then place the server on the ring's corresponding location (such as a light blue circle). This makes up the physical structure diagram of the cache and cache server. [By the way, the server's hash calculation, the general method can use the machine's IP address or machine name as a hash input.] ]
>> How do I map the cache to the corresponding server? The consistent hashing algorithm uses a one-way proximity principle: In this annular space, if the cache hash is started in a clockwise direction until a server node is met, the object is stored on this node because the cache hash and the server hash value are fixed. So this cache is bound to be unique and deterministic. This makes it easy to find the cache object under the corresponding server based on the hash value.
In this way, it is very fast and easy to find the corresponding cache by simply maintaining the hash value of node. And assuming that there is a node outage, you only need to put the cached hash it maintains into the next node, assuming that a new cache is added, you only need to find the nearest node based on its hash value, and the cache change impact is very small . This is the pattern that satisfies the previously mentioned distributed cache 2nd requirement "How to synchronize a changed cache to the corresponding server when the cache is new or deleted".
Of course, what is depicted above is an ideal situation where each node is evenly distributed across the ring. Under normal circumstances, when the number of nodes is small, the distribution of nodes may be very uneven, resulting in the tilt of data access, a large number of keys are mapped to the same server. In order to avoid this situation, we can introduce the virtual node mechanism , compute multiple hash values for each server node, each hash value corresponds to the position of a node on the ring, the node is called the virtual node, and the key mapping mode is unchanged. It's just a step closer to the process of mapping from a virtual node to a real node. Thus, if the number of virtual nodes is large enough, even if there are very few actual nodes, the key can be distributed relatively evenly.
< turn:http://blessht.iteye.com/blog/2124630>
Design and practice of large-scale distributed Web site architecture