Amazing super database solution supporting 0.1 billion pv/day

Source: Internet
Author: User

Support for the 0.1 billion pv/day architecture, open-source
It may be exaggerated to support 0.1 billion pv/day, but it is also to attract you to come in. If you can read it carefully, you will not be disappointed, of course, there must be a lot of "experts" who will sneer at it. It doesn't matter. There are many people with low eyes who always like to comment on others but never see themselves clearly.

If you really want to support me and Chinese Open Source Projects, post this article to your blog or add it to your favorites. Remember to include this document !!!!!!! Thank you.

The system I mentioned is mainly an efficient database cache system built on hibernate, which includes a distributed solution. The system has been applied to the Internet and has not found any major problems, I also believe that the system is powerful enough to handle millions of IP addresses/day applications. I am sure someone will doubt this, in fact, the number of IP addresses that the system can support/day does not lie in the system itself, but in the people who use the system.

The code looks simple, but it is actually a summary of two years of experience. Many difficulties have been encountered during the whole process and finally solved one by one. Therefore, Please cherish the fruits of others' work. The system is simple and easy to use. The main program is BaseManager. java has less than 1000 lines of code and is never described as "excellent, the 1000 lines of code contain the database object cache, list and length cache, hash by field cache, update delayed update, automatic list cache clearing, and other functions, it is enough to implement a vast majority of application websites, such as forums, blogs, ICP filings, and dating communities.

I did a stress test in the ideal state, and on the jsp page without database operations (willing to visit the new homepage) can complete more than 2000 requests per second (normally 1/1000 of requests may have database queries, and the remaining 999/1000 are directly read from the cache), the item details page can complete more than 3000 requests per second, the pure static html page can only complete more than 7000 requests/second. I conducted a three-hour stress test on the home page and completed 24850800 requests. java did not do anything at all, the memory has not increased. Based on 2000 requests per second and 15 hours a day, 3600*15*2000 = 0.1 billion and 8 million requests can be completed every day. Of course, this is an ideal situation, even if the actual status is one off, it can still complete 10 million pv/day, you know, this is just a normal 10 thousand yuan buy server, memory 4G, CPU2, LinuxAS4 system, apache2.0.63/resin2.1.17/jdk6.0 environment.

................................. Now let's go to the topic ...............................

Why cache? If you ask this question, it means you are a newbie. After all, the database throughput is limited, and 5000 reads and writes per second are amazing. If you do not need to cache, assume that there are 100 database operations on a page, the concurrent databases of 50 users are closed, so that up to 50*3600*15 = 2.7 million PVS can be supported, and the database server is too tired to be exhausted. My cache system is more powerful than memcached alone for caching. It is equivalent to two levels of caching on memcached. We all know that memcached is very strong, but the throughput is still limited, 20000 get and put times per second. When encountering ultra-large scale applications, the local HashMap can execute hundreds of thousands of put and get times per second. Therefore, the loss of performance is almost negligible. Tip: Do not use distributed mode when distributed mode is not needed. Instead, use memcached instead. My cache system has already been implemented in this aspect, you can change the configuration. If you are interested, you can test it carefully!

Generally, there are four types of database cache in my opinion. First, the cache of a single object (an object is a row of records in the database). For the cache of a single object, use HashMap. A little more complicated, use the LRU algorithm to package a HashMap, A more complex distributed application of memcached is fine. The second is list caching, which is like the list of posts in the Forum. The third is length caching, for example, the number of posts in a forum can be easily divided into pages. Type 4: complex query of group, sum, and count. For example, the most popular post list ranked by clicks in a forum. The first method is better to implement and the last three methods are more difficult. There seems to be no common solution. I will analyze the list cache (the second method) for the time being.

At the underlying layer of mysql and hibernate, the list results are cached Based on the query conditions during General List caching. However, as long as there are any changes to the table's records (add, delete, or modify ), the list cache should be cleared, so that as long as the record of a table changes frequently (usually this way), the list cache will almost fail, and the hit rate will be too low.

I have figured out a way to improve the List cache. When the table's records change, I traverse all list caches. Only those affected list caches will be deleted, instead of directly clearing all list caches, for example, adding a post in a forum version (id = 1), you only need to clear the list cache corresponding to the version id = 1, version id = 2. This processing has the advantage of caching the list cache of various query conditions (such as equal to, greater than, not equal to, less than), but also has a potential performance problem because of the need to traverse, if the maximum length of the list cache is set to 10000, the two 4-core CPUs can only traverse more than 300 times per second, in this way, if there are more than 300 insert/update/delete operations per second, the system will not be able to afford them.

In the case that the previous two solutions are not perfect, after several weeks of thinking, this person and his colleagues finally come up with a way to cache hash based on some fields in the table, this method does not require large-scale traversal, so the CPU compliance is very small. Because the list cache is hashed by field, the hit rate is extremely high. The idea is as follows: each table has three cached maps (key = value key-value pairs). The first Map is object cache A. In A, the key is the database id, value is the database object (that is, a row of data); the second Map is the General List cache B, the maximum length of B is generally about 1000, in B, key is the String (such as start = 0, length = 15 # active = 0 # state = 0) spelled out by the query condition. Value is a List composed of All IDS under the query condition; the third Map is hash cache C. In C, the key is a hash field (for example, a key is a String such as userId = 109) string, value is a HashMap similar to B. Here, only Map B needs to be traversed. If I don't know the meaning, I don't know it. After reading this example, I should understand it. I will use the reply table of the Forum to describe it, assume that the reply table T contains fields such as id, topicId, and postUserId (topicId is the post id, and postUserId is the publisher id ).

The first case is also the most common case, that is, to obtain the reply corresponding to a post. The SQL statement should be like
Select id from T where topicId = 2008 order by createTime desc limit
Select id from T where topicId = 2008 order by createTime desc limit 5
Select id from T where topicId = 2008 order by createTime desc limit
So it is obvious that topicId is the best way to cache the above three lists (which can be N) hashed to the Map where the key is topicId = 2008. When the post with the id of 2008 has a new reply, the system automatically clears the hash Map with the key being topicId = 2008. Because this hash does not need to be traversed, it can be set to a large value, such as 100000, so that all the reply lists corresponding to the 0.1 million posts can be cached, when a post has a new response, the list of replies corresponding to the remaining 99999 posts does not change, and the cache hit rate is extremely high.

The second case is that the background needs to display the latest reply, and the SQL statement should be like
Select id from T order by createTime desc limit 0, 50
In this case, there is no need to hash, because there cannot be too many people to access the background, and there will not be too many common lists, so you can directly put them in the General List cache B.

In the third case, obtain a user's reply. The SQL statement is like
Select id from T where userId = 2046 order by createTime desc limit
Select id from T where userId = 2046 order by createTime desc limit
Select id from T where userId = 2046 order by createTime desc limit
This list is similar to the first one. Use userId as a hash.

In the fourth case, obtain a user's reply to a post. The SQL statement is like
Select id from T where topicId = 2008 and userId = 2046 order by createTime desc limit
Select id from T where topicId = 2008 and userId = 2046 order by createTime desc limit 15, 15
This situation is rare. Generally, it is subject to topicId = 2008 and put it in the hash Map where the key is topicId = 2008.

The final cache structure should look like this:

Cache A is:
Key (long type) Value (type T)
T object of 11 Id = 11
22 Id = 22 T object
133 Id = 133 T object
......

List cache B is:
Key (String type) Value (ArrayList type)
From T order by createTime desc limit, 50 ArrayList, corresponding to All retrieved IDS
From T order by createTime desc limit 50, 50 ArrayList, corresponding to All retrieved IDS
From T order by createTime desc limit, 50 ArrayList, corresponding to All retrieved IDS
......

Hash cache C is:
Key (String type) Value (HashMap)
UserId = 2046 Key (String type) Value (ArrayList)
List composed of userId = 2046 # IDS
A List composed of userId = 2046 #5, 5 IDS
UserId = 2046 # List composed of 15, 5 IDS
......

UserId = 2047 Key (String type) Value (ArrayList)
List composed of userId = 2047 # IDS
A List composed of userId = 2047 #5, 5 IDS
UserId = 2047 # List composed of 15, 5 IDS
......

UserId = 2048 Key (String type) Value (ArrayList)
UserId = 2048 # topicId = 2008 # List composed of 0, 5 IDS
A List composed of userId = 2048 #5, 5 IDS
UserId = 2048 # List composed of 15, 5 IDS
......

......

Summary: This caching method can store large-scale lists with a high cache hit rate. Therefore, it can withstand ultra-large scale applications. However, technicians need to configure fields that need to be hashed based on their own business logic, generally, the index key of a table is used as a hash (note the order and put the most Scattered Fields in front). Assume that the userId is used as an example to store M lists of N users, if the data of a user changes, the list of the remaining N-1 users cannot be cached. The above describes how to cache the list. The cache length is the same as that of the cache list. For example, the cache length is like select count (*) from T where topicId = 2008, it is also placed in the hash Map of topicId = 2008. If mysql memory tables and memcached are used together, and F5 devices are used for distributed load balancing, the system is sufficient to deal with applications of a scale such as 10 million IP addresses/day, in addition to search engines, general application websites cannot reach this scale.

I declare again: whether the system is powerful is not the system itself, but the people who use the system !!!

This cache system is a summary of my colleagues' years of experience. It seems simple, but it is not that simple. using it as an open source has the following purposes: first, I really hope many people can use it; second: I hope more people can improve and improve it. Third, I hope everyone can join in to contribute to the Cache architecture of the general efficient database. After all, database operations are the most common operations of various applications and are also the most prone to performance bottlenecks.

The Zip package contains the configuration method and jsp used for testing. You only need to configure it as a web application to quickly debug and see the strength of the cache.

The configuration description file is described in the docs/STARTUP configuration .txt.

Finally, if you really want to support me and Chinese open-source projects, post this article to your blog or add it to your favorites. Remember to include the document !!!!!!! Thank you. Thank you and Good luck.

QQ: 24561583

The copied format is a bit messy. Download it and check it. There is a Word document in it, which is clearly written.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.