«Talking about the pit in the database of the MO-Mo hegemony (Mango article) | Back Home | Linode Ad Time»
On the pit of the database of the MO-Mo hegemony (Redis article)
Note: The database part of the MO-Mo hegemony I did not participate in the specific design, but participated in some discussions and put forward some comments. In the occurrence of problems, are also by the Fat Dragon, Xiao Jing, aply students to determine the study to solve. So my judgment on Redis has mostly come from their discussions, plus some of my own speculation, and I didn't read the Redis documentation and read the Redis code carefully. Although we have finally solved the problem, but the technical details described in this article is likely to be contrary to the facts, please read the students to identify themselves.
We did not use Redis on a large scale before the tournament. It's just an intuitive feeling that Redis is perfect for our architecture: Our game does not rely on databases to help us with any data, although the total amount of data is larger, but the growth rate is limited. Because of the limited processing capacity of a single machine, and the game can not be divided, players in any time and place landing, will only see a world. So we need to have a data center independent of the game system. And this data center is only responsible for data transfer and data landing on the can. Redis looks like the best choice, and the game system has a need for it to index the player's data by player ID.
We divide the data center into 32 libraries, separated by the player ID. Data between different players is completely independent. At design time, I firmly opposed the practice of accessing the data center from a single point of view, insisting that each game server node should have more than one direct connection to each data warehouse. Because there is no need to make a single point here.
Based on our prior estimates of the amount of game data, we only need to deploy 32 data warehouses to 4 physical machines in advance, and start 8 Redis processes on each machine. At first we used a 64G memory machine and later increased to 96G memory. Each Redis service is measured to take up to a total of up to $ G of memory, which seems more than sufficient.
Since we are just a copy of the Redis data from the documentation, it is not clear what will be on the pit, for the sake of insurance, also equipped with 4 physical machines as Slave, the host for data synchronization backup.
Redis supports two kinds of BGSAVE strategy, one is snapshot mode, when the landing command is launched, fork out a process to dump the entire memory to the hard disk, and another call AOF way, all the database write operations recorded. Our game is not suitable for AOF, because our write operation is too frequent and the amount of data is huge.
The first accident was on February 3, and the New Year's holiday was not over. Because the whole holiday is peaceful, operation and maintenance is relatively slack.
At noon, there is a data service host can not be accessed by the game server, affecting some users to log in. The online attempt to repair the connection was fruitless and had to start up to 2 hours of downtime maintenance.
During the maintenance period, the problem was preliminarily identified. is due to a machine running out of memory in the morning, causing the slave to restart the database service. After the slave re-connects to the host, 8 Redis simultaneously sends SYNC's impact, and the host is destroyed.
There are two questions that we need to discuss separately:
Problem one: The hardware configuration and the host are the same from the machine, why there is not enough memory from the opportunity first.
Question two: Why the re-SYNC operation will cause the host to overload.
We did not delve into the problem at the time because we did not estimate the accuracy of user growth during the Chinese New Year and correctly deploy the database. The memory requirements of the database have increased to a critical point, so it is quite possible to feel that an out-of-memory accident occurred on the host or slave machine. Hanging out from the machine may just happen (now I'm afraid not, the cold-standby script is probably the culprit). In the early days we take turns BGSAVE, when the amount of data increases, it should be appropriate to adjust the large BGSAVE interval, to avoid the same physical machine Redis service BGSAVE at the same time, resulting in the fork more than the process needs to consume too much memory. As the Chinese New Year has gone home for the new year, this thing has also been neglected.
The second problem is because we know less about the mechanism of master-slave synchronization:
Think about it, what would you do if you were to achieve synchronization? It takes a certain amount of time to reach the synchronization state. Synchronization is best not to interfere with the normal service, then ensure the consistency of synchronization with the lock is certainly not good. So Redis also triggers a fork in sync to ensure that sync is sent from the machine to the right sync point. When our slave restarts, 8 slave redis simultaneously turn on synchronization, equal to the moment to fork out 8 redis processes on the host, which makes the probability of the host Redis process entering the swap partition greatly increased.
After the accident, we canceled the slave machine. Because it complicates system deployment, it adds a number of destabilizing factors and does not necessarily improve data security. At the same time, we have improved the mechanism of bgsave, not triggering with timers, but a script to ensure that multiple Redis bgsave on the same physical machine can take turns. In addition, the mechanism that previously made cold on the slave machine was also moved to the host. Fortunately we can use the script to control the time of cold, as well as stagger the BGSAVE IO peak.
The second incident occurred most recently (February 27).
We have adjusted the Redis database deployment several times to ensure that the data server has enough memory. But there was an accident. The end of the accident was due to a lack of memory that caused a Redis process to use a swap partition and the processing power was greatly reduced. In the case of large data congestion, an avalanche effect occurred: Xiao Jing in the original control BGSAVE in the script to add the line of the guaranteed rules, if 30 minutes did not receive the BGSAVE directive, the enforcement of a guaranteed data will eventually be landed (I personally have objection to this rule). As a result, half an hour after the data server loses its response to the external, multiple Redis services go into the BGSAVE state and eat up the memory.
It took a day to trace the culprit of the accident. We find that the cold-standby mechanism is a curse. We regularly copy the Redis database files to a packaged backup. While the operating system is copying files, it seems to use a lot of memory to do the file cache and not released in time. This results in a BGSAVE of the system memory, which significantly exceeds the upper limit we originally anticipated.
This time we adjusted the kernel parameters of the operating system, turned off the cache and temporarily solved the problem.
After this accident, I reflected on the data landing strategy. I don't think it's a good plan to do BGSAVE regularly. At least it's a waste. Because every time BGSAVE will save all the data, in fact, the large amount of data in the memory database is not changed. A current 10-20-minute save cycle, where data changes are only available to players who have been on the line during this time period and the players they have attacked (approximately 1 to 2 attacks per 20 minutes), which is far less than the total number of players.
I want to be able to back up only the changed data, but I don't want to use the built-in AOF mechanism, because AOF will continue to append the same data, causing the hard disk space to grow too fast.
Nor do we want to add an intermediate layer between game services and database services, sacrificing read performance, which is critical to the entire system. It is unreliable to simply forward a write instruction. Because of the timing of lost and read commands, it is possible to make the data version garbled.
What if the game server is going to write data simultaneously to Redis and another data landing service simultaneously send a copy of the data? First, we need to add a version mechanism to ensure that we can recognize the number of write operations received in different locations (I remember the Bug that the data version was garbled in the wild Blade), and second, it doubles the write bandwidth between the game server and the data server.
Finally, I thought of an easy way: Start a guardianship service on the physical machine of the data server. When the game server pushes the data to the data service and confirms the success, the ID of this set of data is sent to the guardianship service at the same time. It then reads the data back from Redis and stores it locally.
Because this guardianship service and Redis 1:1 are configured on the same machine, and the hard disk write speed is greater than the network bandwidth, it must not be overloaded. As for Redis, it becomes a pure memory database and no longer runs the BGSAVE.
This monitoring process also does data landing. For the data landing, I chose Unqlite, a few lines of code can do its Lua package. It has only one database file, more convenient to do cold standby. Of course LevelDB is also a good choice, if it is implemented in C rather than C + +, I will consider the latter.
Docking with the game server, I started a separate Skynet process on the database machine, listening for the synchronization ID request. Since it only needs to deal with a few redis operations, I have specifically written the Redis instructions. Ultimately, this service has only one Lua script, in fact it consists of three Skynet services, one listening to the external port, one processing the Redis synchronization instructions on the connection, a single point writing data to Unqlite. In order to make the data recovery efficient, I specifically saved the player data, the recovery of Redis instructions to spell. So once the recovery is needed, just read the player data from the Unqlite and send it directly to Redis.
With this thing, the hot and cold data in Redis is solved. Long-term non-landing players, we can periodically clear from the Redis, in case the player landed back, only need to let it help recover.
Xiao Jing does not like me to rely on the realization of Skynet. He began to want to use Python to achieve the same thing, and then he became interested in the go language, want to take the need to play the go language. So to this day, we have not yet deployed this new mechanism to the production environment.
[Go to Cloud blog] Talk about the pit of the Mo-Mo hegemony in the database (Redis article)