Talking about the paralysis of the Olympic ticket booking system--talking about fastcgi and it architecture

Last Update:2018-07-27 Source: Internet

Author: User

Tags ticket microsoft iis

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There is nothing bigger than the Olympic Games for the people of the capital in the 2008. How to buy a satisfactory ticket to the game, but also become a lot of people's dream. However, when the Olympic official website to buy tickets, the dream is easily by the online ticketing system to the paralysis of the broken into pieces, a lot of enthusiastic people are also extremely depressed. Because Sohu undertook the official website of the Olympic Games, I worked there for a long time, many brothers failed to rob tickets, so they decided that Sohu developed the system is too bad, and find me complaining. In fact, I was also very depressed: first of all, this system is not sohu development, and secondly I am not sohu. Even so, some of my fellow friends began to ask me how to solve similar problems. I have repeatedly said many times, in order to allow the vast number of readers to understand the reasons behind and the mechanism, write, we discuss the possible effect will be better. Of course, this is not the structure that I say will solve the problem, just a discussion.

Before I talk about architecture, I'll start with an old technology, FastCGI. Because this technique will play a very important role in the later structural elaboration, it was thought that a lot of people would know it, but it turned out to be not so.

I will not dwell on the history of fastcgi, as if it had been since 1993. The most popular YouTube architecture for video sites is the fast-cgi module. It supports many HTTPD servers, listed on the official website a lot, such as APACHE,AXESW3, Microsoft Iis,zeus, in recent years, the LIGHTTPD did not write, in fact, this new httpd also support, but I personally feel that the support of the best, It might be Apache.

Let's start with the principle of fastcgi, which is different from the current run request, and there's a term on Wikipedia that describes it, and here's a loan:

Short lifetime applications

Long lifetime applications

The mechanism of CGI technology is that each time a client requests a CGI, the Web server requests the operating system to generate a new CGI process, and the server kills the process when the CGI meets the requirements. And the server repeats this process for every request to the client.

And the mechanism of FASTCGI technology is: Once the FASTCGI program has been produced, it can continue to work, keep satisfying the customer's request until it is explicitly terminated. If you want to improve the performance of your program through collaborative processing, you can ask the Web server to run multiple fastcgi applications.

This CGI is called a short lifetime application, and fastcgi is the so-called long lifetime application.

Because the FASTCGI program does not need to constantly generate new processes, can greatly reduce the pressure on the server. and has a high application efficiency. Today, the popular Java language servlet technology is designed to refer to fastcgi technology.

The FastCGI configuration runs in general three kinds, these three kinds all need Apache mod_fastcgi to handle.

1, Standalone FastCGI Server, should be a stand-alone servers. The first is the need to fastcgi as a separate daemon:

$ script/myapp_fastcgi.pl-l/tmp/myapp.socket-n 5

The following are the parameters of this fastcgi daemon:

-d-daemon #作为守护进程

-p-pidfile #管理进程的PID写入到到文件的名称

-l-listen #SOCKET的路径, machine name: port, or port

-n-nproc #起始接受请求的进程数

Then add the following code to the Apache httpd.conf:

Fastcgiexternalserver/tmp/myapp-socket/tmp/myapp.socket

alias/myapp//tmp/myapp/

# or, which can be run using the identity of root

Alias//tmp/myapp/

# Optionally, (using rewrite module)

Rewriterule ^/myapp$ myapp/[R]

and restart Apache, OK.

2. Static mode: Statically model, which is generally used for a single determined mode, which is added in the middle of Apache httpd.conf:

Fastcgiserver/usr/local/apache/count/count.fcg-processes 1

ALIAS/C/USR/LOCAL/APACHE/COUNT/COUNT.FCG

It is recommended that you rewrite the entire URL match again using rewrite to make it look like a static page.

Rewriterule read-(.)-(. +)-(. +). html$/c?id=$1&sid=$2&port=$3 [L]

3. Dynamic mode: Dynamically, you can use a variety of fastcgi, add to the middle of the httpd.conf, such as:

AddHandler Fastcgi-script. fcgi

There is also a key setting:

<Directory/path/to/MyApp>

Options +execcgi

</Directory>

This configuration recommendation is placed in a similar directory like Cgi-bin.

Note that the second one, the server has several processes, is controlled by the-processes 1, so how much you can decide for yourself, we will use this pattern in a key module below.

Below put a section of fastcgi program C code, to explain:

#include <fcgi_stdio.h>

#include <string.h>

void Main (void) {

int count = 0;

while (fcgi_accept () >= 0) {

printf ("content-type:text/html");

printf ("");

printf ("<HTML>"

"<HEAD>

"<TITLE>FastCGI</TITLE>"

"<meta http-equiv=" Content-type ""

"Content=" text/html; Charset=utf-8 "> <body>" "Hello world!<br>");

printf ("Request number%d", count++);

printf ("</body>
}

Exit (0);

}

This is a very simple example, is a simple count, you can pay attention to this sentence: while (fcgi_accept () >= 0)

This is it and ordinary short cycle program The biggest difference, general CGI is run out, this fastcgi, after processing a request completed, will return to the initial state waiting for the next request; If this program is set to only start one, then whether or not to access this page, Are added on the basis of the previous one, and no new processes are produced; Of course, many people also notice that here is a dead loop in the process; If the program is more complex, there is a memory leak problem, the problem is more serious than the ordinary CGI, so the use of it for the programmer is also higher.

The above scenario should be the most efficient and fastest in all Web application solutions. The official data is about 15 times times higher than the average, testing on my machine, and basically being able to process about 2,400 requests per second.

Back to the point we said: The paralysis of the Olympic ticketing system, about the number of visits, at that time is 8 million/hour, then the average to more than 2,200 times per second. This is indeed a very big test for booking systems. After all this situation, the database is certainly unable to bear this magnitude of the visit. How to design the architecture is a problem that we all need to face.

If the design is to deal with this high load, high traffic structure, first consider the requirements of the system. In fact, the specific process is relatively simple:

1. User authentication

2. See the number of items and tickets that can be booked

3. Select items, put in the shopping cart

4. Confirm and submit the order

5. Successful Order deduction

Although the process is simple, but in fact there are a lot of things.

As a result of the user's large amount of data, registered users more than millions of, and this system, the user should not be in the operation of the general application of the 2/8 principles. In the day of the ticket grab, most of the registered users will be logged in, and the time will be very concentrated, so the concurrency will be very large. If you have enough budget, put 10,000 servers to do this thing, do a distribution algorithm, and then each service no more than 100,000 users, so you can fully guarantee your users feel and experience. But I don't think any company or system would actually do that, even the rich and the Olympic organizers.

At this time, many people may think: the above mentioned fastcgi this efficient program is for the solution of similar situation, in fact, this is a very common mistake. I think the reason this ticket is paralyzed is because some of the design is too efficient and partly impossible to be efficient. For example, the efficiency of the login module is very high, because the login is only in the database to compare user name and password, and the data update is not frequent, can be used to solve the distributed database. But when the user logs in, all the pressure will be on the back of the function, resulting in system paralysis. This time, because of too many people, you no matter how efficient, in the implementation of the complex to the back of the purchase function, there will be bottlenecks. And if you really put 10,000 services, your data how to distribute synchronization, and then really do first come first, it will be difficult, if the design is not good, and the lottery will be no different.

So the system design strategy should be: how to ensure that users feel the situation, reasonable control of the number of access to the system, so you behind the design and development of the pressure will be much smaller, and the cost control is very clear.

Then the rest is clear: the focus of the system is the user login, rather than the general understanding of the system function of the back purchase ticket submission. How to control the number of people to enter, I think you can refer to the bank of the way to design: The system first to the user number, and then when you know that there are resources available, and then let users log in.

This structure is focused on the call center and the distribution of serial numbers.

1. Serial Number distribution center, the technical focus is on efficiency and uniqueness. That is, you need to assign a unique serial number to the logged-in user very quickly after the user has reached a huge number of accesses. In this situation, many other technologies cannot afford this demand. The fastcgi of the beginning is the only option in this model. When we start the installation, we can use this mode of only a single process, so assigning a user's serial number will only be unique. Because of the high efficiency of the fastcgi, users who log in can quickly assign a number and then leave. Of course, if you're not comfortable with it, you can add a load-balanced device to the front, complete the load allocation on several different servers, and then each machine adds a different step size and the starting number is not the same. For example: If you have 2 machines to do the work, the first starting digit is 1, the second one is 2, the step is two, which is cumulative 2 each time, so that users on different machines will get a unique number, and efficiency can be increased by twice times.

As for how to record the user serial number, you can record it with a cookie on the client and then encrypt it. After the user records, enter the call center, than the number in the opponent and the number of queues in front, and then prompts the user to queue number before. For example, you come up is automatic arranging after 30 million, the front has more than 20 million people, I think if this person's mind is normal, will not say that this system is too bad, can only say that they got up late, and then lamented that the Chinese is too much, will not go up repeatedly and constantly log on.

2. Call center, here is probably the most troublesome, but also the most important place. As the booking system is b/s structure, server-side action, how to notify the client is a key point. In other words, when someone has finished booking, exit from the system, at this point, the Central control center after knowing, will inform the call center called the next. Call center How to find the number should be called, there are two solutions, concrete implementation can be achieved through the local refresh Ajax.

The first one, and the number of the call system, if the match is found to be successful, notify the client into the system.

The second is to determine the number of queues in front of the user and, if found to be zero, trigger the movement into the system.

Also pay attention to a point, that is, the length of the refresh time and the expiration of the problem. The time is too short, the server pressure will be very large, too long and will easily cause the user feel no change, so feel very bad. So the setting of this time, the individual feel in 5-15 seconds between adjustment will be more appropriate. Then the pressure needs to be apportioned, that is to call the server need to set multiple. In this way, user refreshes hit different servers, which require special processing of data synchronization, which is structured as follows:

This message acceptance module can have two modes of information: short connection, every time to pass information, long connection, is in the message acceptance and the central control server to establish a long-term message notification mechanism. Because of the high demand for information timeliness, the use of long connections is more appropriate.

A serial number exchange is required between the message acceptance module and the central control server. Because you don't know which server the user who pinched this number hits, the failure mechanism needs to be done at the same time on several servers. In other words, when a user exits, the central control server know, start to confirm the last login number, and then send all the front-end, the front-end to be able to guarantee the notification to the user, and then send a notice to the user, if the user in a given number of times without login or authentication, the prompt backend this number is invalid The system assigns the next number to the front end for notification,

If you want to design more sophisticated, you can also establish a message notification mechanism between front-end servers. When a server discovers this number on its own, notify several front-end, no longer to judge this number, as far as possible to save resources.

3. Medium control server. I used this approach when I was developing the community and the live room, and I used it here. However, in this system, the central control server does not have to use a separate physical server, this can be just a module, its main purpose is to notify these station-station server. Because the data is simple, it is easier to distribute the controls, without designing particularly complex protocols.

4. Certification Center: The only change is to determine whether the user's serial number is available and is the real number.

5. Ticketing Center: There are many kinds of distribution methods, there are many structures to learn from, here is not to repeat. In this framework, the only thing that needs to be confirmed is how many people can afford to buy at the same time online.

The first three parts is the core of this framework, because the number of people can be controlled, the back of the system can also use the old booking system, only to confirm the number of put in at the same time can be, that is, the window has not changed, but we no longer swarmed, are civilized people, please line up

Of course, the structure behind can also be optimized and designed to maximize the number of bearer, in the design of the ticketing function can also learn from this aspect of the model. For example: Basketball is like the audience more sports, we all think of the scene to see Kobe Bryant students dunk, in person, maybe everyone will swarmed first Rob this, resulting in local data paralysis, affect the entire system. This module can also be implied in this case. The number of people who buy less, take the number can not see out, take it will be able to go in; Once the numbers reach the limit, I'm sorry, please line up.

After restricting entry, the person who does not enter and buys the person is not in the same system, thus will not hinder the entry of the person, the buyer will soon be resolved, they can quickly complete the order. After submitting, the system found that the person can not subscribe to other tickets, it is possible to think of another person to come in, or simply do a little, immediately kicked out, to save resources.

Also, because you can control the number of users entering, the other parts of the system are designed much simpler. How much money to do, if the leaders want to hurry up, budget sufficient, then put more people, if the heart is not bottom, then can put very few people come in, or say probably estimate, only put how many number, such as sell 100,000 tickets, then put 500,000 number, put it is gone. The user came late, even the number is not, can only lament that they are not timely, so much better than the system paralysis.

For this architecture, the focus of the design is to put the overall resources of the system in a controllable state. Many similar systems, such as registration, exams, short time snapping and so on, can be solved in a similar way. A good structure is not a solution to all the problems, but a clear understanding of what you can and cannot do.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More