Come with me. Data Mining (17)--Distributed cache

Source: Internet
Author: User

Distributed Cache Architecture

Look at the schema first:

Figure A

By accessing the HTTP server and then accessing the application server resources, the application server invokes the backend database, accesses the database directly at the time of the first visit, and then puts the cached content into the memcached cluster, depending on the size of the cache file. At the time of the second access to the cache read directly, do not need to do the database operation. This is suitable for scenarios where the data changes infrequently, such as the list displayed on the Internet station, the reading ranking, etc.

The 48-hour reading of the blog park is similar to this one:

Of course, cached schemas are used in more ways than one, which can be used in data mining systems for infrequently updated data or offline data. Specific applications require architectural design in different scenarios.

Consistent Hash

A distributed hash (DHT) implementation algorithm, proposed by MIT in 1997, was designed to address hot spot problems in the Internet, with a similar intent to carp. The consistent hash corrects the problem caused by the simple hashing algorithm used by carp, so that distributed hashing (DHT) can be really applied in the peer-to-peer environment.

The consistency hash algorithm proposes four definitions for determining the good or bad hash algorithm in a dynamically changing cache environment:

1, Balance (Balance): The balance is that the result of the hash can be distributed as far as possible in all buffers, so that all buffer space can be exploited. Many hashing algorithms can satisfy this condition.

2. Monotonicity (monotonicity): monotonicity means that if some content has been allocated to the corresponding buffer by hashing, a new buffer is added to the system. The result of the hash should be to ensure that the original allocated content can be mapped to an existing or new buffer without being mapped to another buffer in the old buffer collection.

3, Dispersion (Spread): In a distributed environment, the terminal may not see all the buffers, but only to see part of it. The end result is that the same content is mapped to different buffers by different endpoints when the terminal wants the content to be mapped to buffering through a hashing process, because the buffer range seen by different terminals may be different, resulting in inconsistent results for the hash. This is obviously something that should be avoided because it causes the same content to be stored in different buffers, reducing the efficiency of the system's storage. The definition of dispersion is the severity of the above-mentioned situation. A good hashing algorithm should be able to avoid inconsistencies as far as possible, that is, to minimize dispersion.

4. Load: The load problem is actually looking at the dispersion problem from another perspective. Since different terminals may map the same content to different buffers, it is possible for a particular buffer to be mapped to different content by different users. As with dispersion, this situation should also be avoided, so a good hashing algorithm should be able to minimize the buffering load.

In distributed cluster, it is the most basic function of distributed cluster Management to add or remove machine, or automatically leave the cluster after machine failure. If the use of commonly used hash (object)%n algorithm, then after the machine is added or deleted, many of the original data can not be found, which seriously violates the monotony principle. The next step is to explain how the consistent hashing algorithm is designed:

1. Hash machine node

First find the Machine node hash value (how to calculate the Machine node hash?) IP can be used as a hash parameter. Of course there are other ways), and then distribute it to a ring in the 0~2^32 (clockwise distribution). As shown in the following:

Figure II

There are machines in the cluster: A, B, C, D, E five machines, through a certain hash algorithm we distribute it to the ring as shown.

2. Access method

If there is a write cache request where the key value is K, the calculator hash value is hash (k), the hash (k) corresponds to a point in the graph –1 ring, if the point corresponding to a specific machine node is not mapped, then look clockwise until the first time to find the node with the mapped machine, The node is the target node that is determined, and if it exceeds the 2^32 still cannot find the node, hit the first machine node. For example, the hash (K) value is between A~b, then the hit machine node should be a B node (such as).

3, increase the processing of nodes

such as two, on the basis of the original cluster to add a machine f, the increase process is as follows:

The hash value of the computer node that maps the machine to a node in the ring, such as:

Might

After adding the Machine node F, the access policy does not change, still according to (2) in the manner of access, when the cache is still unavoidable, the data that cannot be hit is the hash (K) in increasing the node before the data between c~f. Although there is still a hit problem caused by the increase of the node, but compared with the traditional method of hashing, the consistency hash has reduced the data to a minimum.

Memcached

Memcached is a high-performance distributed memory object caching system for dynamic Web applications to mitigate database load. It improves the speed of dynamic, database-driven Web sites by caching data and objects in memory to reduce the number of times a database is read. Memcached is based on a hashmap that stores key/value pairs. Its daemon (daemon) is written in C, but the client can write in any language and communicate with the daemon through the memcached protocol. The topology of the cache is described in Figure IV:

Figure Four

Features of the memcached include:

    • Full memory operation
    • stored in hash mode
    • Simple text protocol for data communication
    • Just manipulating character data
    • Other types of data are interpreted by the application, serialized and deserialized
    • Clusters are also controlled by the application, using a consistent hash (hash) algorithm
memcached Variant Product introduction

There are many products based on memcached developed at home and abroad, these products support all memcached protocols, while focusing on different application scenarios, can choose suitable memcached variants according to their application needs. Several variants of memcached are described below.

1. Memcachedb

Memcachedb is an open source project based on memcached development of Sina Network. By adding Berkeley DB's persistent storage mechanism and asynchronous primary and secondary replication mechanisms for memcached, memcached has the transactional resiliency, persistence, and distributed replication capabilities that are ideal for applications that require ultra-high performance read and write speeds, persistent preservation, for example, Apply Memcachedb to the management of Sina blog. If you have persistent requirements for memcached, consider using Memcachedb.

2. repcached

Repcached, a memcached-based patch developed by the Japanese, implements memcached replication, which supports replication between multiple memcached to resolve memcached disaster tolerance. There are cache disaster requirements that you can try to use this feature.

3. Memcached_functions_mysql

This feature is equivalent to MySQL's UDFs (User Defined Functions), which updates memcached through a trigger in MySQL. This allows data to be written to MySQL and then retrieved from memcached to relieve the pressure on the database while reducing the amount of development effort.

The use and experience of Memcached_functions_mysql will be described in more detail in the next section.

4. Memcacheq

MEMCACHEQ implements Message Queuing on the basis of memcached. The following is an example of a PHP client that describes how Memcacheq implements Message Queuing.

Message from tail into stack: memcache_set

Message from the head stack: memcache_get

The biggest advantage of Memcacheq is that it is based on memcached and can be manipulated by various memcached commands. Applications based on memcached do not require any modification at all.

Summarize

Distributed caches are often used in situations where they are frequently accessed or cannot be processed in real time, including the following scenarios:

1, the traffic volume is more than the update amount of the scene.

2, the need to process data offline scene.

3. The back end is a column-type database scenario.

There are many other scenarios that integrate with the model layer of the data mining system to quickly provide access to the model processing results, in short, the good use of caching in the architecture design will greatly improve the performance of the application and make the whole system more robust.

Come with me. Data Mining (17)--Distributed cache

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.