Let's talk about the database pitfalls of Momo Warcraft (ranking)

Source: Internet
Author: User

Why do most network services require a database to support the entire system in the background?

This is usually because most systems have a short running cycle. for traditional website services, from receiving an HTTP request to receiving the result of this request, is a running cycle.

In the meantime, datasets that may be processed are very large and usually do not have time or even space.) load all the data to the inner storage and process a small part of the data, save it to the disk and then exit.

When the data volume is large, any algorithm and data structure for data operations must be carefully designed. This is not a task that can be easily completed by any programmer. In particular, when the data volume exceeds the memory capacity, many algorithms and data structures are unfamiliar to most non-programmers in this field. In line with the principle that professionals are responsible for their work, the system generally submits this part of work to an independent database.

Data operations can be robust only when the abstraction is simple enough, so that the SQL language can be abstracted to make data management work independent. Even if you want to sacrifice some of the features to improve performance, you can also choose a variety of NOSQL databases that have become popular in recent years.

In the MMO game server field, things have changed a little bit.

Data is closely related to business logic and changes are very frequent. The MMO server must respond to user requests continuously and quickly. It is almost impossible for us to put all the data in an independent database, such as the position of a player in the virtual world and the list of other players affected by the player; Various attribute changes during the player battle, there are also changes in the status of those NPCs that interact with players ......

The biggest contradiction is that the change of datasets in MMO games is no longer something that simple SQL statements can express, and cannot be handed over to the database service period. No matter what type of database, it is not designed for such applications. If you want to apply the application models in other fields, the game server can only frequently read various data from the database, make changes based on the game logic, and then write them back. The database becomes a very inefficient data transfer center. Whether you use a memory database or not, you cannot change this inefficient nature.

I have heard of countless bad systems designed by programmers from other fields to game development. In the end, they only regard the database as a reliable data storage point and shard point. They think that writing the so-called important data into the database is all right, and then reading and using the data from the database in another awkward way. The system is full of strange asynchronous callbacks to the database to improve the system's response speed, but the system is beginning to decline. It is already the limit to doing the right thing. What's more, the game system is not only correct in terms of input and output. If the response time of the application exceeds, everything is incorrect.

To make the system robust, the architecture engineer will isolate the system into different modules and try to simplify the communication rules between modules. In this way, you can verify the quality of each module separately and change it if necessary. Almost no one writes the application code to the OS kernel for efficiency or convenience of development.

Each module is only responsible for the input of its data to ensure that the output is correct. Generally, the test is only responsible for this correctness. The easiest thing to ignore is that each module has an upper limit on the processing speed of its input data, that is, its throughput.

Once the input speed is greater than the processing speed, it is useless to implement the module correctly. Because there will never be any output.

For most modules, this is not a problem as long as the memory capacity is sufficient. In actual operating systems, there are very few input with a sustained amount of data. From a long point of view, the total data input is smaller than the processing capability, data that cannot be processed temporarily is accumulated in the memory.

There are exceptions in everything. A robust system must handle exceptions. A Database working in the server mode solves this exception as follows: it supports query connection concurrency, and concurrent queries are fair to each other for computing resource usage, mutual influence is at least an ideal design ). However, the operating system or database itself limits the number of concurrent connections. Once the maximum number of connections is reached, the system rejects the service. In this way, the input that exceeds the processing capability is blocked outside the module. According to this design, there will be no input as long as it can be reached) There will never be a response.

Unfortunately, the cost of doing so is that you must add a failed request processing between modules. A system that is not designed with caution is the easiest way to handle errors. They always expect that any module can correctly process requests from superiors.

Btw, why is the 12306 ticket booking system completely unavailable under high loads? That is, this is not handled well. I mean, a system that is correctly implemented will not even be unable to flushed from webpages or give users correct prompts, even if it is just an error prompt. It should not be under high load, the effective processing capability drops sharply. I mean, once a user enters the normal process, he should successfully complete at least one step, rather than suddenly stuck there without any response.

You are about to run the question. When I talk about this, I actually want to express it. It's easy to say it and it's hard to do it. In the next article, I will write that an accident we had before the Chinese New Year has something to do with it.

Gossip time:

Momo dance troupe is a 2nd game launched by Momo game platform. Our Momo hegemony is still under development, and this game is going to be launched. I have heard about the limited knowledge of this product, so if you are more familiar with the situation, please forgive me. For technical problems, I think the truth is not that important. If there is something, you can change it.

The dance troupe brand originally belongs to South Koreans, but this game has been very popular in China, and its long-time travel agent in China has bought an IP address and made its own mobile phone version. As far as I know, Momo dance troupe is fully developed in Shanghai and has nothing to do with South Koreans.

This is a relatively simple game. At least the server part is very simple, that is, counting scores, checking rankings, and solving the charging problem. It is a standalone game and does not require servers.

Thanks to the brand name of the Dance Troupe and the huge user base of Momo, the game was launched on the free ios rankings as soon as it was launched. If it wasn't for Penguin, the company immediately launched the rhythm guru, and it is estimated that it will become more popular on the list. After the event, it was proved that the launch of the rhythm master was also very hasty. It was completely aimed at cracking down competitors, because the latter's server was also unstable and soon crashed, it is totally different from the quality that a large company should possess.

After the Momo dance troupe successfully pulled the user, the server went wrong on the first day. After restarting the server several times, the problem was completely unsolved. So I decided to stop the service. One stop is three days. At that time, I was wondering, how can I fix a small bug that would take three days? This must be a structural issue. At that time, our project was scheduled to take the last half month. We were very busy, and all of them flew to Shanghai.

One week later, Momo's Dance Troupe was launched again, far more than the expected three days. Things 1: The technical team of Momo, from CTO to the following, all of them flew to Guangzhou to hold a meeting with us. Let us pay attention to server stability issues. The content of the meeting mainly emphasizes the massive volume of users imported to the Momo platform at the early stage, as well as understanding our design details to ensure there are no major problems. The gossip I learned was heard during this time.

Momo dance troupe uses MongoDB. It seems that this is quite popular with game developers. I think it's mainly because it's easy to use. If game practitioners do not have development experience in other fields before, most people know about databases. Especially those developed from the client, their usual habit is to read the API documentation and understand how to use it looks correct. Then I went online and tested it. It seemed that the work was over. Even with stress testing, it is difficult to be consistent with the production environment.

It is said that the two sides had communicated before the launch. Momo wants to confirm whether the system can be scaled horizontally. The answer is: Add hardware. I want the developer of Momo dance troupe to think like this: our server system is very simple. Isn't it all about databases? MongoDB has been verified by many people and won't cause problems in such a simple business. As for the load, isn't there any mongos? Don't worry, it's okay.

The final problem lies in the ranking. When 20 thousand people are online, that's right. Only 20 thousand people are there.) the query of a large number of users blocks the database. As a result, not only the rankings cannot be flushed out, but even the value-hitting business is also affected. The local tyrants do not need to pay for the game. Eventually, an avalanche occurs, and the entire database is abnormal, making the game system unable to work.

Why does it take so long to fix this bug?

The person responsible for the server development of Momo dance troupe left after the project was completed. Think about how bad it is to deliver a problematic system to non-designers for maintenance? Any sober programmer knows that, at this time, even rewriting is easier than modifying. Momo's staff made a correct decision, directly sent their own people to reside in Shanghai, and re-wrote the server.

Momo's technical background is Redis. Their systems are built with redis, so redis is replaced with mongodb for rewriting. Here, I have no idea who redis and mongodb are. The key lies in people, what you are familiar with, and what kind of database can deal with this small business. The key is whether you can use the right database.

Redis exactly has a Sorted Set data structure. After you use ZADD to insert data, it is naturally ordered. This insertion is the time complexity of O (M * log (N) and can basically meet the requirements. However, ZRANGE only requires O (log (N) + M) time complexity to query the list.

So using Redis and using sorted set for ranking system is our only choice? Absolutely not. We cannot choose redis as a database for this feature. However, this example shows that if the database provides internal features, you can perform some operations on the dataset, but we need to know the performance of such operations. It needs to match the performance expectations of the entire system.

It is not a big problem for Momo dance troupe to use the built-in sorting function of mongodb for ranking. The performance may only be poor because the implementers are not familiar with mongo. With the reconstruction of the system, we can no longer go into detail. However, the core issue is: Why does an incorrect implementation of a ranking system affect the stability of the entire system?

Here is my guess:

In order to increase the throughput of the database, many programmers do not create a connection for the database as a transaction, and close it when they are used up. Creating a TCP connection is costly. Maintaining too many connections is also an overhead for the system. Students like to create something called a connection pool. They use this connection pool where other parts of the system and databases are connected. As long as an old connection is not disconnected, requests to the database are sent to the database through a fixed connection, waiting for return.

This module is easy to implement correctly when the database throughput meets system requirements. However, once the requirement is exceeded, the data in the connection pool accumulates and the database query slows down. The Database Calling module does not think this is a problem.

The correct behavior should be to allow the connection pool to quickly feedback, disconnect and discard the request that cannot be processed, and let the requester feedback the error that cannot be handled to the previous step, until the traffic is limited to a reasonable range. The entire system will not crash. When the error is forced to be reported to the player, most of the information he sees is query failure, which does not affect other functions.

How does Momo go to rankings?

In the previous article, some people asked, if you don't need a database, how do you make rankings? In fact, I have a question in the previous article:

"The server is constantly creating new data and making the data flow in the memory. It does not need to read data from the outside. If the memory is infinitely large and the server will never become a machine, there is no need for the database facility to exist ."

The ranking list is also one of the Data. When the game server is ready for service, no players have ranking information. With the change of player rankings, the list is gradually formed. We only need to synchronize the changes to the list when the player scores change. Player queries only remove ordered lists.

Do you think this process has nothing to do with the database? You need to design the algorithm for adjusting the list and the data structure of the list to ensure that the performance of the List is strong enough. Because the frequency of gamer term replacement is much lower than the Network Package frequency, the lower limit required by this module is easy to meet. We don't have to worry about handling the problem.

We do this for Momo hegemony:

In Momo hegemony, the score range used for ranking is not big, that is, 0 to 5000 points. The number of participants in the ranking is large, with millions. Insert sorting is performed for millions of users, and each insert is unacceptable even for O (N. But the fact is that a large number of players share the same score, and they all share the same rank. Therefore, we only need to create 5000 buckets, and each bucket only records the number of people with this score.

When the player scores change, the original bucket is reduced by one, and the new bucket is increased by one. This operation is O (1.

The ranking query only needs to accumulate the buckets with the top scores to obtain the ranking of the queryer. It makes no sense for millions of players to see who are tied with you. Although this query is O (n) complex, n only has a partition of 5000. It can also be used as a cache to cope with queries that are much more frequently than updates.

The top 200 people who really need to know the name of a person are listed in the list, and the insertion sorting of the top 200 people is also very fast, so it will not cause performance problems.

We maintain the ranking on the single point of view in the system, and there is no external database operation at all. It is just a small segment of c code that operates the common memory structure. However, this single point of failure is far from becoming a hotspot for the entire system.

When the system exits temporarily, we will release the list that has already been scheduled and restore it at the next startup. However, you do not have to trust the data that is stored. You can use an offline script to retrieve the entire database and generate a correct list. Therefore, the list in the database is only cached and does not need to be written to the database during system operation, nor worry about data loss.

Well, we still haven't talked about the pitfall we step on, and it's time to eat again :(.

Tomorrow I will write about the first database accident that Momo hegemony encountered during its operation. It is related to mongos. At the same time, we will also talk about some mongodb-related pitfalls that we helped during the proxy mad blade.

Original address. BKJIA is authorized to be reproduced by the author.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.