1. Question 2. Treatment 3. Organization 4 Analysis 5 Summary
1. The question
The morning found that the single version of the system customer general feedback Flash, the architect looked at the Redis is full of the resulting. The login session information is placed in the Redis, the problem appears after full, the new login information can not write. Use is Aliyun Redis service, has done a renewal of the upgrade, one months after the entry into force, now directly to do the expansion too late, try a bit because the previous order already exists, upgrade failed. 2. Processing
Emergency processing, the first empty redis all data, then the monomer version of the system can be logged in, but found that some of the functions can not be used, only restored the system 80% of the available functions, the investigation of the discovery is due to a part of the page template also put the Redis, empty after the template has not Restart the template project, write to Redis, and restore the system. 3. Arranging
Since there is a serious problem of urgency, subsequent arrangements and summaries are necessary:
1, the development team to analyze the problem, the use of the situation to find out,
2, the above emergency processing process as a step, written to the wiki,
3, at the same time require the addition of redis use rate monitoring, SMS notification,
4. Analysis
Redis uses the master-slave configuration, 2G, the total capacity is small, but consider has been used only 100 m, so there is no immediate expansion, and then configured a one-month follow-up fee automatically added to 4G.
Starting 3 weeks ago, the use of Redis has soared. There was no problem because there was no warning.
The monomer version and the chain edition share one redis,db0 for the monomer, the use quantity is not big, DB1 for the chain has 2.29 million keys (occupies 1.7G), causes the usage rate to be too high.
There are 3 kinds of data in Redis:
-Login session information, used to multiple machine session sharing, has expired time, monomer 1 days, chain 30 days,
-page template data, hundreds of, used to render the page, static data, but the period, occupy very small;
-Business code, the operating period of the temporary data, etc., this data has expired, but not a period of time (business completed after the deletion).
Each time the user explicitly logs on, a new session ID is created, which results in a new record in the Redis, so it is reasonable to set the expiration date in the monomer version. The user shuts down from work every day, the next day to go to work at least once system, must have at least one login (in fact, far more than, according to statistics users each log on average use of 18 minutes, active users may log 5-10 times a day). 5. Summary
According to the observation found that the basic daily log information of key growth of about 80,000, one months before the expiration of the resulting problem.
This shows that our use of caching lacks the necessary guidelines, capacity expectations, and monitoring tools.
1, monitoring warning: Set up to use more than 90% to do alert notifications, notify the architecture group and developers.
2, Design principles: In principle, the data on the Redis must have an expiration time, in addition to the static data (template, etc.), other data each add a no expiration time, will result in resource reduction, long time to say resources are not enough, so be sure to set the expiration time. One day is enough for the logon session information. For temporary data such as staging, 2-3 days is enough. A form, the customer input half, 2 days still do not deal with him, certainly not important to him, 3 days do not deal with the customers themselves have forgotten that the empty data on the customer impact is not small. If you have to save for a long time, consider the MongoDB or MySQL storage that can be landed.
3, Emergency strategy: the development of emergency management strategy, clear the chain of DB1 data operation steps.
4, capacity strategy: Every time on the line a need to use the cache resources function, need to estimate the use of capacity, good ahead of schedule expansion.
5, Isolation strategy: for different applications, different uses of the cached data, should use a different redis instances, to avoid interference between the system.
6, Expiration policy: Adjust the Redis exceeds the maximum memory expiration policy, from lru+ expiration time, to LRU, so that after full memory, will first discuss the recent use of cold data. So as not to affect the purpose of business.