From http://my.oschina.net/u/142836/blog/171196
Memcached is a high-performance distributed cache system, with its simple and convenient operation, stable and reliable performance is widely used in Internet applications, online about memcached introduction of a lot of information, the most classic data is "memcached comprehensive analysis" This document, Original link: http://gihyo.jp/dev/feature/01/memcached/0001, Chinese translation online many: http://tech.idv2.com/2008/08/17/memcached-pdf/, This document is very well written and easy to read. I'm going to summarize some of the common application scenarios and solutions.
1. Cached Storage Design
According to the different scenarios, the following two designs are generally available:
Scenario One: The database SQL query results cached to memcached, read the data when the first read from memcached, blocking the database query request.
Advantage: We can do some unified cache processing on the development framework, transparent to business development, and reduce the intrusion of business logic code. In this case the cache is also more convenient to preheat, we can use the database storage log (Eg:mysql Binlog) to warm up the cache.
Disadvantage: This way has a hidden danger is that if the front-end of a request to involve multiple SQL query results, this time memcached need to take multiple data, in high concurrency, the overhead of network IO and memcached concurrency pressure can become a bottleneck.
Scenario Two: The final result of the business process is cached, the client can directly return the results of the cache when requested.
Advantages: Can quickly return data, only one time memcache can be, reduce the network IO consumption and processing consumption.
Cons: the need to explicitly process the cache in business logic while storing the data structure is more complex, and when we have data updates, it can be cumbersome to regenerate the cache. This scenario is more suitable for computationally intensive high concurrency scenarios.
2. Cache Update Policy
Two common scenarios, each with its pros and cons and application scenarios:
Scenario One: Lazy loading, the client first query memcached, if hit, return results, if not hit (no data or expired), then load the latest data from the database, and write back to memcached, and finally return the results.
Advantages: Easy to use, simple;
Disadvantage: If the cache fails in high concurrency, it will cause instantaneous pressure on the backend database. Of course, we can use the Riga lock to control concurrency, but it also affects the application.
Scenario Two: The active update policy, the data in the cache will never expire, when there is data update, by a separate program to update the cache.
Pros: Cached data is always reliable (without LRU), the front end can respond quickly, and the backend database does not have the pressure of concurrent queries.
Disadvantage: The program structure becomes complex, need to maintain a separate program to complete the update, two sets of programs to share a set of cache configuration. (PS: In fact, there are some business scenarios, such as the content of the portal site system and the site system will need to share a piece of data, a responsible for writing data, a display of data)
3. Bulk delete (or update) issues
In memcached, most of our operations are based on a single key add/set/del/get operation, it is very convenient to use, but, sometimes we will encounter the bulk delete (or update) problem. For example, a mobile app application because of the sensitive content, the network regulatory authorities to delete all the information related to this content, this time because the phone model, version, this content in the cache key is various. We can't easily get all the keys, or we can enumerate all the keys, but memcached does not support the bulk delete operation, which is troublesome, how to solve this problem? Below I use a portal site to delete sensitive news For example, we assume that every news has a lot of dimensions of content, news to NewSID logo, each dimension with prop to old acquaintances, plus a generic prefix, so that the complete key should be the format: Key{newsid}{prop}
Programme one:
A single set (set) is used to maintain a class of keys. When you need to delete (or update) in bulk, you only need to remove all the keys in this collection to do the appropriate operation. This is relatively straightforward:
First, when we add a new k,v to the memcached, we put a key in that set, such as a piece of news in memcached with the following pairs:
Key_{newsid}_{prop1}:value1
Key_{newsid}_{prop2}:value2
Key_{newsid}_{prop3}:value3
......
Key_{newsid}_{propn}:valuen
In our collection, we're going to store all the keys associated with this piece of news:
KEYSET_{NEWSID}:KEY_{NEWSID}_{PROP1},KEY_{NEWSID}_{PROP2},......, KEY_{NEWSID}_{PROPN})
In this way, when we want to clear the cache of this news, we can take out the collection of this key, and then traverse these keys, to memcached inside delete, so that the purpose of bulk deletion.
Here, the key set we mentioned is exactly how to store and maintain it?
One way to do this is to memcached all keys with commas into a large string to form the value of keyset or to organize the data into memcached with a set structure (set) provided by the development language.
Another way is to save this key with a more convenient storage structure, such as the set structure of Redis, which, of course, is not recommended and will bring complexity to the existing system.
Scenario Two:
By dynamically updating the way the key is implemented, this way is to each key in the original key on the basis of a version number to compose, when the need for bulk deletion or update only upgrade version number can be, specifically how to do?
First, we maintain a version number for this news in memcached, so that:
key_version_{newsid}:v1.0 (the version number can be replaced with a timestamp or any other meaningful content)
Pseudo code
$memcacheClient->setversion (Key_version{newsid}, "v1.0");
Then, when we want to save or read the news related data, we first take out this version number to generate a new key, as follows:
Pseudo code
$version = GetVersion (Key_version_{newsid});
$key = "Key_{newsid}_{prop}_" + $version;
Then use this new key to save (or read) the real content, so that the one that is related to this news in memcached is the following:
Key_{newsid}_{prop1}_v1.0:value1
Key_{newsid}_{prop2}_v1.0:value2
Key_{newsid}_{prop3}_v1.0:value3
......
Key_{newsid}_{propn}_v1.0:valuen
When we need to delete (or update) all key related to this news, we only need to upgrade the version number, as follows:
Pseudo code
$memcacheClient->updateversion (Key_version_{newsid}, "v2.0");
In this case, when we next visit the cache of this news, because the version number is upgraded, all content under the new key is empty, the new content needs to be loaded from the database, or the result will be returned empty. The old key can be recycled after the expiration time. This achieves the purpose of our bulk delete or update.
The above mentioned two scenarios are actually relatively simple and practical, of course, there are shortcomings, program one of the key set maintenance needs additional consumption, the old version of the scheme two data can not be cleaned up in time, resulting in cache garbage. We have a flexible choice in the actual application scenario, and there is really no difference between the two in effect.
4. Problems with failover and expansion
Memcached It is not a distributed system, strictly speaking a single point system, so-called distributed only by the client to achieve. So it doesn't have the high availability of those open-source distributed systems, so let's talk about how memcached avoids single points of failure and the problem of online expansion. (Ps:memcached do really province, the biggest feature is simple, a lot of auxiliary functions to rely on the client to achieve).
Consistent hash: Well, this should be the simplest and most common mechanism, relying on the characteristics of consistent hashing, node failure or expansion plus node when the impact on the cluster is small, basically can meet most of the application scenario. But note: In the initial period of the node adjustment, there will be some cache loss, penetrating to the back-end database, in high concurrency applications, to do the concurrency control, so as not to pressure the database.
Two-write mechanism: The client maintains two clusters, each update data at the same time update two copies, read random (or fixed) read a copy, in this case, the availability and stability of the cluster is very high, can be painless change, node failure or expansion of the cache and back-end database have no impact. Of course, there is a price to do this: one is the consistency of the two data, but for the cache, this very few inconsistencies can be tolerated, the other is the memory waste problem, the redundant data to reduce the failure rate, the price is very large, not suitable for large-scale Internet applications.
Twemproxy: This is a Twitter open source agent, can give Redis and memcached agent, with this thing can reduce a lot of maintenance costs (mainly the client). It is also convenient for failover and online expansion. For details, refer to: Https://github.com/twitter/twemproxy
5. Some minor details related to optimization
Bulk Read (Multiget): Some more complex business requests may request multiple memcached operations at a time, where the consumption of the network round trip and the concurrency pressure imposed on the memcached node are still quite considerable. In this case, we can consider bulk reading to reduce the number of network IO round trips, return the data at one time, and reduce the client's business processing logic.
Here's a famous Multiget bottomless problem, found in Facebook applications, please refer to: http://highscalability.com/blog/2009/10/26/ Facebooks-memcached-multiget-hole-more-machines-more-capacit.html, this article has proposed a solution. But in fact, we can also consider the Multiget key distribution to a node, to avoid this problem, so that you need to customize the Memcache client, according to certain rules (such as: the same prefix) to the same node to distribute a class of key, to avoid this problem, This also improves performance without having to wait for data between multiple nodes.
Change serialization mode: Object serialization without Java (haha, I'm only here for Java), to serialize itself, to serialize the object to be cached into a byte array or string to save. This has a good effect on both memory savings and network transmission.
Data preheating: In some scenarios we need to warm up the cache data for the application (for example, node expansion needs to redistribute the data), in the previous mention of the cache design, you can use the database's update log to warm up the cache, which is mainly dependent on the content of the cache is consistent with the database storage. In other cases, we can consider the existing cache in front of a block of empty content of the cluster node, the old cache gradually read to the new cache cluster to achieve the purpose of data preheating, this is a bit of trouble, need to apply the end of the match.
Growth factor: rational adjustment of memcached growth factor, can effectively control the waste of memory.
The processing of empty results: Some scenarios in our database do not find data, the cache is also empty, this time need to store a short time in the cache of empty results to block the front-end of the frequent requests to avoid the pressure on the database.
The use of memcached is very simple, performance is also very good, these are our actual business development will encounter some scenarios, according to the actual scenario to choose the right solution, can give a lot of convenience for future development and maintenance.
Memcached Application Summary