Atypical background path for tens of millions of users

Last Update:2015-12-27 Source: Internet

Author: User

Tags database sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Three years ago, I was just an unlearned data hacker, and I was very enthusiastic. At that time, the company was in a difficult period of transition, and the old products were not getting better and the future of new products was unknown. It is impossible to imagine that such a small amount of data can be used to play with the flowers, and the data of new products will not become the scale at half past one. In the spirit of learning to the maximum extent, I had the courage to change my job with my boss. I had to carry the Background Development flag and participate in the product line to the maximum extent. One small decision was to get a full half-year sleep night, seeing 1st to 500th million users, seeing 1 to 4 and then 10 servers, as you can see, the background services have a complete social interaction from single playback, playing, and uploading. From the first three days of crash and an accident, to the end, I am not afraid of marketing colleagues to engage in new activities. I am aware of all the situations, and I am not panicking. Looking back, I was shocked: I was not a back-end engineer, and I was never powerless to argue about the language and framework. The basic knowledge of network programming was even less satisfactory. However, with my Xiaomi rifle, I was very clever, with continuous thinking and continuous attempts, we can build a background framework supporting tens of millions of users. In conclusion, five life-and-death suggestions were left behind in the half year.

Data read/write is the core of server performance

For a complete background service, there are only three components:Access, logic, and data. This is like a hotel. The back-end engineer is the shop owner. The number of guests is smaller than 10 thousand, and the service process is the first. The bosses are busy writing logic; the number is between 10 thousand and 0.1 million, the Design of access components will be of the utmost importance: the service capabilities of a store are limited, and the bosses are busy opening several more branches to allow customers to distribute traffic. The access component is used to determine the branch to which the guests arrive; however, once a user exceeds 0.1 million, the data read/write capability determines the service capacity of this super hotel. No matter how many branches are opened, the data must be consistent, reading is fast and accurate, but data writing does not affect the reading performance. How to Design the table structure, how to distribute the database (Master/Slave, read/write splitting, database sharding, and table sharding), and how to select the distribution of cache are the most important tasks of the bosses (which makes the boss happy, the business card can also be changed to a tall head: architect ).

Once the number of users exceeds 100,000, it is unrealistic to rely on a database truck to hit the world, and the cache (the physical storage is in the memory, which is inherently better than the database read/write performance) the emergence of this wild horse satisfies our extreme needs for speed. Caching has brought two profound influences to the server architecture: first, separation of hot data and cold data: there are many people accessing hot data, and the cache is in front of each other, shares the huge read Pressure on the database, and hot data should receive quick responses from the product perspective. Second, the threshold for data consistency is increased. when updating the database, the cache must be updated. Once the cache update fails, the database must be rolled back to ensure data consistency, it is not a joke about cold food for the guests. Of course, what to save and how to store cache is also very knowledgeable. Let me talk about it in the next section. However, the importance of caching is summarized as follows:It is absolutely impossible to have no cache. Whether you choose the old horse memcached or the hot horse redis, be sure to mount the file before the database feels the pressure, and make a cache backup and recovery plan. Of course, you can't feel the benefits of caching. It is like a spare tire that reminds you to eat, eat, and drink hot water. Only when she leaves you, you watched the server crash as a response time of times, so you can't wait to find a piece of tofu and try it.

List, entity, and Redundancy

In the web era, users are not sensitive to changes in the list because of page switching before and after page turning. (If the list is added to the content while page turning, you only need to ensure that this segment is not repeated ), however, the design of the scroll list on the Mobile End is simply a nightmare for all background Engineers (adding users to pull the list to get more and adding new content, then the user will see two adjacent duplicate content, and then it will blow up, what is the app !), A book is enough to address the problem of "Duplicate lists. Because of this requirement, we can only discard the original auto-incremental ID and use the timestamp as the way to get list fragments: Simply put, that is, each time the client reports a timestamp of the last content of the current page, the server then obtains several older content than this time. I would like to thank the author of redis for providing such a rich set of APIS for caching. I think the best thing about redis is to imagine all the use cases in the list.

The entity is hot data, and the cache of hot data has two questions: first, what is the storage? Some people may say that it is easy to convert the entire struct into a JSON file and save it? However, this is actually a problem. When your server needs to face hundreds of thousands of users at the same time, it may take tens of millions of JSON data to switch back and forth between the struct in just a moment, the efficiency of this process is actually not ideal, so you may want to think about faster solutions (here I will buy a customs clearance ). Second, how to save it? The avalanche effect is not uncommon. Once the source data changes, many threads access the updated cache API at the same time, and the server is instantly congested. It is expected that background engineers will be unemployed, I added a lock silently.

Xiao Zhang is a waiter serving dishes. This time, he's going to pick up potato shreds in the cold dish area, pick up Dongpo Meat in the vegetable area, pick up a shredded cabbage in the plain food area, and finally pick up two bottles of juice in the beverage area, which sounds inefficient, right? This is similar to the data acquisition process. The primary consideration for table design in the database is classification. For example, if the user information is stored in a table, the relationship between the user and the group is stored in another table, if a user needs to read the user and the group he or she has accessed, the user must read the data table twice. Once this scenario occurs frequently, proper data redundancy (adding the last group ID accessed by the user to the field of the User table) can reduce the read Pressure on the database. SoTable design must be certain (the important thing is to say three times) to consider business scenarios..

Is it true asynchronous?

Some of my friends came to ask me, I chose the best server framework, asynchronous multithreading, and a single process concurrency of more than 10 thousand. Why is it slow? I said, the word "Asynchronous" is not too easy to say. The underlying layer is asynchronous. Is every step in the process asynchronous? Database read/write, cache read/write, external interface access, these cannot be asynchronous? Since it is not asynchronous, you still don't know where it is stuck, and you don't need to log on quickly. Let's talk about one of the cases that most caused me to crash: When a server is blown up, no way can be found to identify the cause of blocking when many logs are played. What is the Last guess? It turns out that the log component (log4j) is not asynchronous, and the logging step gets stuck.

Log, monitoring, and lossy services

A high-level hotel should have chefs, lobby managers, dishes, and cashiers, but do not forget to have security guards. Although he is not the core factor for the hotel's success, if he is absent, he will not be able to cope with the crisis. The three buddies below are server security guards: logs, monitoring, and lossy services.

First, let's talk about the log. The log is very subtle. If the number of logs is too large, it will affect performance, occupy space, and reduce the number of logs. The cause of key problems cannot be found. So what are mandatory? I think there are three points: first, the basic attributes of behavior are nothing more than when and where people, time, user ID, IP address, and version (stored in addition to troubleshooting, it can also be used for data analysis ); the second is the round-trip parameters, especially the parameters reported by the client. The data returned by the server may be very large. We do not recommend that all data be printed. Statistical data can be printed, such as how many groups are returned; the third is to report the error information. The bottom layer must catch all the error information and hit it into a separate log.

Besides, monitoring is a tool that helps us find the cause of the problem once a problem is found. What can help us find the problem? The answer is monitoring and alarms. Different from logs, the monitoring should focus on the core data, not many. I suggest taking three pieces of data: the number of concurrent users, the average response time for reading, and the average response time for writing, if the alarm is triggered, the number of server crashes and restarts, and the host performance indicators (CPU, memory, hard disk, and so on) are added ).

"You don't want to do anything like this. Are you hungry ?」, If the server crashes due to bad luck, I often use this TVB line to make fun of my friends. In fact, no matter whether it is done in advance, the growth of the app will always encounter server issues. However, with my limited experience,Server problems often do not come out of itself, but are caused by components it depends on.For example, the memcached machine dump, the transcoding service queue is blocked, or the image storage space is full. So before the problem is solved, you can't just stare at it and watch the user complain? We will think, what is the most unavoidable scenario for the current business? For example, playing is our most basic service, so we must ensure that the crash of any external component will not affect the playing of popular content, therefore, we need to load this small and important hot data into the memory to prevent external storage from any problems, and the server itself has a bowl of noodles. Actually, it's not a good guy to do your own things and rely on your ancestors by day.

Service separation and Replication

The longer the server system grows, the first thing we do is to split the system. When my son grows up, he will always give him a piece of land. When he becomes a little wang, he will fight for it himself. As a result, data reading and writing is abstracted as a service, and at the same time, it is responsible for the app and the front-end. encoding and decoding are abstracted as a Service. Anyway, encoding and decoding are provided to UGC users, people who want to be stars always have to wait. log storage and parsing are also abstracted as services. We don't mind if there is a little loss. On the surface, it seems that the server is broken down and the network latency is increased. This is an uneconomical business, but it is actually helpful to the stability of the server. Why? First, the Kingdom has been split into small kingdom. It is easier to locate problems, migrate and replicate data, and there is pressure to read and write data? No problem! Two more sites. Second, in the entire chain, every link is multi-point. As the saying goes, if you don't put all the eggs in one basket, dump on any server won't let us go.

Summary of the experiences of the high-speed development of servers in the past six months. I think the most important thing is the five points. The architecture and focus of servers will certainly be slightly different in different business scenarios; however, these five points are basically equivalent to a tip, which is equivalent to a cornerstone and a life-saving character. Now, the hotel business will flourish. Congratulations, boss!

For more highlights, please pay attention to the Public No. "code farm Cafe 」

Atypical background path for tens of millions of users

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More