First, let me introduce the background of using the cache by myself, so that readers can better understand what I will talk about below.
I am mainly a cache implementer, not a user. To provide cache support for some ORM (such as JPA implementation), I Need To package other open source caches and evaluate their features.
I know a lot about the working principles of these open source caches. I know little about the specific configuration and usage details.
This article focuses on the features and working principles of cache, rather than an Entry Manual for installation, configuration, and usage.
This article briefly describes the general features of cache and describes the advanced features of cache, such as distributed cache, associated object cache, and pojo cache.
This article requires basic cluster knowledge, Orm knowledge, and Database Transaction knowledge. This article does not explain these basic concepts.
-------------------------------------------------------
Cache features
First, let's look at the common cache.
This link provides common Java open source cache.
Http://java-source.net/open-source/cache-solutions
Memcached, JBoss cache, swarmcache, Oscache, JCs, ehcache, and other open-source projects have a high output rate and Failover rate.
Memcached is different from other ones, which will be detailed later.
JBoss cache features a large and comprehensive function, which can be regarded as a cache set builder and is supported in almost all aspects.
The remaining parts are lightweight. Swarmcache, Oscache, and JCs support cluster. Ehcache does not support cluster.
The basic features of the cache are listed below.
1. time record
The time when the data enters the cache.
2. Timeout expiration time
How long does the data in the cache expire?
3. eviction policy clearing Policy
What policies should be used to clear data when the cache is full.
For example, the least frequently accessed data and the least accessed data.
4. hit rate
Cache data selected rate
5. Hierarchical Cache
Some caches have the concept of classification. For example, almost all caches support the region partition concept. You can specify a type of data to be stored in a specific region. JBoss cache supports more levels.
6. distributed cache
Cache distributed on different computers
7. Lock, transaction, and Data Synchronization
Some caches provide complete locks and transaction support.
Most of the above features have corresponding API support for cache. These APIs are intuitive and easy to use. This article does not describe them.
This article describes the advanced features of memcached and JBoss cache, including distributed cache support.
-------------------------------------------------------
Memcached
Http://www.danga.com/memcached/
Memcached is a remote cache Implementation of the client server structure.
Server is written in C and provides client APIs in multiple languages, including Java, C #, Ruby, Python, PHP, Perl, C, and other languages.
Memcached is mainly used in shared nothing architecture. Applications use client APIs to access data from memcached server.
Typical applications, such as using memcached as the database cache.
Memcached is often used to store http session data. The specific method is to encapsulate the session interface and intercept the setattribute () and getattribute () methods.
Memcachedsessionwrapper {
Object getattribute (key ){
Return memcachedclient. Get (session. GETID () + key );
}
Void setattribute (Key, value ){
Memcachedclient. setobject (session. GETID () + key, value );
}
}
Applications on different computers use one IP address to access the memcahced server.
The data corresponding to the same key exists only in the memory of one memcached server.
Memcached server can also be deployed on multiple computers. Memcached uses the hashcode of the key to determine from which memcached server to access data. We can see that the data corresponding to the same key still exists only in the memory of one memcached server.
Therefore, memcached does not have data synchronization problems. This feature is critical. When we talk about cluster cache, data synchronization will be involved.
Because memcached is a remote cache, both the key and value put in the cache must be serializable.
Remote cache, the most worrying network communication overhead. According to experienced people, memcached has low network communication overhead.
Memcached's API design is also remote communication friendly. It provides calling methods of high granularity such as getmulti (), which can obtain data in batches, thus reducing the number of network communications.
-------------------------------------------------------
JBoss Cache
Http://www.jboss.org/products/jbosscache
There is a commercial cluster cache called tangosol.
JBoss cache is the only open-source cache that I know can rival tangosol.
The data synchronization of cluster cache requires network communication. Therefore, the data to be put into the cache must be serializable.
JBoss cache introduces the concept of pojo cache, which means that data is not serializable and can be synchronized in the cluster.
JBoss pojo cache uses the AOP mechanism to support object synchronization, object attribute synchronization, associated object cache, inheritance, collection, and query, and transactions at different levels, A small memory database-level data storage system.
The following is an explanation.
The most confusing thing is how the pojo cluster synchronization is implemented.
JBoss pojo cache uses AOP to take care of the communication and dissemination work of pojo. There is no free lunch in the world, and pojo does not support serialization. The framework itself needs to do this job-marshal and unmarshal. For example, by translating Java objects into XML and spreading them out, the other party receives XML, then translate it into a Java object.
As mentioned above, JBoss pojo cache is like a small storage container. The object management of JBoss pojo cache is similar to hibernate, JDO, JPA, and other ORM tools. It also has the concepts of detach and attach.
Attach is put, which puts the object into the cache. Detach is to remove the object from the cache. Why multiple names?
The reason is that when put is put, it is a clean pojo. When it comes out, it is an enhanced object, which contains a lot of interceptor code and the method of listening to objects.
When you operate on this object, the JBoss AOP framework receives corresponding notifications and can respond accordingly, such as data synchronization.
JBoss pojo cache supports collection-type AOP. Similarly, you need to import the set attach (Put) into the cache, get it out, and then perform operations on the set to be intercepted by JBoss AOP.
JBoss pojo cache is based on the JBoss tree cache. This tree cache is similar to an xml dom tree data structure.
JBoss cache uses full qualified name as the cache key, similar to XPath. For example, A/B/C/D.
When you delete a/B, all the keys and corresponding data belonging to a/B, such as A/B/C/D, are deleted.
The findobjects method of JBoss cache can find a string of objects. For example, findobjects can find four objects A, B, C, and D Based on A/B/C/d and put them in a map to return them.
For specific usage, see API details, because JBoss pojo cache provides many behavior modes.
This Hierarchical Cache function is very useful and is not difficult to implement. I think it is not strong enough. Since it supports keys similar to XPath, it is better to simply support XPath conditional queries. For example, a [name = "N"]/B/c. Of course, the cost for implementing this function is very high. You need to traverse the entire cache tree, just as XPath needs to traverse the entire DOM node.
Finally, like tangosol, JBoss cache supports a function, lock mechanism, and transaction support that I think is just like a chicken ribs. This transaction supports four transaction isolation levels similar to databases.
In my opinion, this kind of support is undoubtedly intended to earn attention. The cache is not used properly. It is a big and improper method. If you want to use it as a database, it is better to focus on the specific batch query function mentioned above.
-------------------------------------------------------
Cluster Synchronization
There are multiple implementation methods for cache synchronization between clusters. For example, JMS, RMI, client server socket, and other methods are the most widely used. The most widely supported method is the multicast implemented by the jgroups open-source project. Configuring cluster cache is usually equivalent to configuring jgroups. You need to read the jgroups configuration document.
Cache operations include get, put, remove, and clear.
For cluster cache, the read operation (get) must be a local method. You only need to obtain data from the memory of the current computer. The remove/clear write operation must be a remote method and must be synchronized with other computers in the cluster. Put can be local or remote.
The remote put method is used in this scenario. If a computer places data in the cache, the data will be transmitted to other computers in the cluster. The advantage of this practice is that the cache data of each computer in the cluster can be supplemented in a timely manner. The disadvantage is that the data volume to be transmitted is large and the cost is relatively high.
The local put method is used in this scenario. A computer places the data in the cache and the data will not be transmitted to other computers in the cluster. The advantage of this practice is that data does not need to be transmitted. The disadvantage is that the cache data of each computer in the cluster cannot be supplemented in a timely manner. This is not an obvious problem. Data is not obtained from the cache, it is normal to obtain data from a database.
Local put has obvious advantages over remote put. Therefore, generally cluster cache adopts the local put policy. The local put configuration option is generally provided for each cache. If you do not see this support, please use another cache.
-------------------------------------------------------
Center vs Cluster
Memcached can be viewed as the center cache.
The features of center cache and cluster cache are compared as follows:
The center cache is not subject to synchronization issues. Therefore, it is advantageous to remove/clear and do not need to send notifications to several computers.
However, get/Put/remove/clear is a remote operation for all center cache operations. The get/put operations of cluster cache are local operations. Therefore, cluster cache has advantages in get/put operations.
Local get/put has obvious advantages in assembling and splitting associated objects.
This is the description of the associated object.
For example, there is a topic object, and below there are several post objects, each post object has a user object.
When the topic object is stored in the cache, the following associated objects must be split and divided into their own entity region for storage.
Topic region-> topic ID-> topic object
Post region-> post ID-> post object
User region-> User ID-> User object
In this case, the put action may occur multiple times. The overhead of remote put is relatively large.
The get process is similar. You also need to get multiple times to assemble a complete topic object.
-------------------------------------------------------
Expired data
Cache can be used anywhere, such as page cache. However, the most common scenario of cache is used in Orm, such as Hibernate, JDO, and JPA.
There is a principle for Using ORM cache-do not put data without commit into the cache. This is to prevent read dirty.
There are two types of database transactions: Read transactions without modifying data, and write transactions and modify data.
The procedure for writing a transaction is as follows:
DB. commt ();
Cache. Remove (key); // In this step, the cache data is cleared and a time removetime is recorded.
The read transaction procedure is as follows:
Readtime = current time;
Data = cache. Get (key );
If (data is null ){
Data = dB. Load (key );
Cache. Put (Key, Data, readtime); // readtime is required here
}
Note that the readtime parameter is required for put.
This readtime must be compared with the last removetime.
If readtime> removetime, this put operation can be successful and data can be cached.
This is to ensure that the expired data is not put into the cache and the database changes are promptly reflected.
In addition, it should be noted that cache. Remove (key); this event needs to be propagated to other computers in the cluster to notify them to clear the cache.
Why is this notification required?
It must be noted that this is not to avoid concurrent modification conflicts. To avoid concurrent modification conflicts, we need to introduce an optimistic lock version control mechanism.
There may be such a misunderstanding that the cache. Remove notification is not needed because of the optimistic version control mechanism. This is not correct.
The main purpose of the cache. Remove notification is to ensure that the cache can promptly clean up expired data and reflect data changes. This ensures that the application does not display expired data to users most of the time.
In addition, DB. commt (); cache. Remove (key); There is a small possibility of another transaction between the two call steps. During this very short period of time, read committed may not be guaranteed and very short-term expired data may occur.
Why is it short-term, because the cache. Remove will clean up expired data.
If you are paranoid to this level, such a small probability event that is almost impossible to happen in such a short term cannot be tolerated, then yes, DB. before commt (), add a pessimistic lock to the cache so that other transactions are not allowed. Put the data into the cache to prevent this event with low probability and slight impact.
JBoss cache and tangosol provide pessimistic locks for these chicken ribs. Typical development resources are improperly configured, and useful functions are not required.
Orm query Cache
Orm cache is generally divided into two types. One is ID cache (called Level 2 Cache in the orm document), which is used to store the entity object corresponding to the entity ID; the other is query cache, which is used to store the query result set corresponding to a query statement.
Id cache is very intuitive. As described above, an entity class corresponds to a region, and the entity is stored in the corresponding region.
Query cache is complex and has a huge potential. It is worth a careful explanation.
The existing ORM is not ideal for query cache support.
For example, Hibernate directly places the entire result set in the query cache. In this way, the query cache needs to be cleared if any database write operation occurs.
There is a better way to store the ID list in the query cache. Each time you obtain the ID list, you first obtain the ID list and then obtain the Entity List based on the ID list. Query cache is cleared Based on the table name involved in the query. Once these table names are modified, the query cache can be cleared based on different situations.
For example, select T2. * from T1, T2 where t1.id = t2.foreign _ id and t1.name = 'A'
Insert into T1, delete from T1, insert into T2, and delete from T2 will clear this query cache.
The statement such as update T1 set also clears the query cache.
Why does hibernate not do this? The query cache is complicated. Maybe the selected result set does not have only one entity type, but may only contain several fields.
There is still a lot of work to do in this part. It is also worth the effort, because the query cache has a great effect on performance improvement.
-------------------------------------------------------
Orm query Cache
Cache can be used anywhere, such as page cache. However, the most common scenario of cache is used in Orm, such as Hibernate, JDO, and JPA.
Orm cache is generally divided into two types. One is ID cache (called Level 2 Cache in the orm document), which is used to store the entity object corresponding to the entity ID; the other is query cache, which is used to store the query result set corresponding to a query statement.
Id cache is very intuitive. As described above, an entity class corresponds to a region, and the entity is stored in the corresponding region.
Query cache is complex and has a huge potential. It is worth a careful explanation.
The existing ORM is not ideal for query cache support.
For example, Hibernate directly places the entire result set in the query cache. In this way, the query cache needs to be cleared if any database write operation occurs.
There is a better way to store the ID list in the query cache. Each time you obtain the ID list, you first obtain the ID list and then obtain the Entity List based on the ID list. Query cache is cleared Based on the table name involved in the query. Once these table names are modified, the query cache can be cleared based on different situations.
For example, select T2. * from T1, T2 where t1.id = t2.foreign _ id and t1.name = 'A'
Insert into T1, delete from T1, insert into T2, and delete from T2 will clear this query cache.
The statement such as update T1 set also clears the query cache.
Why does hibernate not do this? The query cache is complicated. Maybe the selected result set does not have only one entity type, but may only contain several fields.
There is still a lot of work to do in this part. It is also worth the effort, because the query cache has a great effect on performance improvement.
-----------------------------------------------------------
Query key
The query cache performance needs to be considered in several aspects. For example, query key. The query Key consists of two parts: query string, SQL, hql, eql, or oql, and parameter.
When looking for the corresponding data of the query key, there are two steps to compare the query key: hash and then equals. Therefore, the hashcode and equals methods of the query key are very important. Especially the equals method.
The equals method requires a long query string. If there is no hit, the query string is not equal, so the overhead is very small, because in general, the length of unequal strings is different, or the previous strings are not the same. The biggest overhead is that when the query string is equal, the string needs to be compared from the beginning to the end.
We can use some methods to increase the string comparison speed. For example, most of the cases are static queries. We can use Singleton string. The Comparison Between strings with the same reference is fast. For Orm, it is best to use the outermost hql, eql, and oql as the query key instead of the generated SQL result. Because the generated SQL result is a new string with different references, the entire string needs to be compared when the cache hits.
It is difficult to improve the performance of dynamically assembled query strings. Because the final result is a new string. One way I use is to dynamically assemble a string []. If two strings [] are equal, the reference values of the element string are equal. This is the result of JVM Optimization on the string constants in a class.
For example,
String [] A = {
"Select * from t where"
"A = 1"
"And B = 2"
};
String [] B = {
"Select * from t where"
"A = 1"
"And B = 2"
};
Therefore, the comparison between A and B only requires three string reference comparisons.
References to ehcache do not support cluster
This is incorrect. After ehcache1.2, cluster is supported,
Another point is that the query cache description is not very correct,
1. query cache is actually a very large category. It should not be used only on the ORM layer, nor be limited to using SQL statements as keys. This is only one of the cases where the definition of query cache is used in Orm.
2. the query result set can also be cached. This is useful only when the same parameters are frequently used for queries. In the query cache, it does not cache the exact status of the objects contained in the result set. It only caches the values of the Identifier attributes of these entities and the results of each value type. Compare the printed SQL statement with the latest cache content and modify the difference to the cache. Therefore, the query cache is usually used together with the Level 2 cache. Although query cache can improve performance, it is applicable in less scenarios according to hibernate in action.
If we enlarge the query cache concept, it can greatly improve the performance of data that is not highly real-time required. For example, if the homepage is updated for 5 minutes, then the query cache can play its role.