Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall
With the advent of the 11 holiday, the 12306 discussion of the Ministry of Railways has been more and more. This article (original) from 12306 website extension to the site performance of a lot of discussion, for entrepreneurs and technology enthusiasts have a strong reference. The author Chenhao (Weibo) has 14 years of experience in software development, 8 years of project and team management experience.
12306.cn website Hung, was scolded by the people all over the country. I have been thinking about this for two days, and I want to make a cursory discussion about the performance of the website. Because of haste, and entirely based on my limited experience and understanding. Discuss only performance issues, not the UI, the user experience, or whether the functionality that separates the payment from the purchase of a single link is not discussed.
Business
Any technology is inseparable from business needs, so to illustrate the performance problem, first of all, to talk about business issues.
First, some people may compare this thing with QQ or online games. But I think the two are not the same, online games and QQ or log in to visit more is the user's own data, and booking system access to the center of the ticket data, this is not the same. Do not think online games or QQ can do you think it is the same. Online games and QQ back-end load relative to E-commerce system is still simple.
Second, some people said that the Spring festival during the scheduled train this thing like the site of seconds kill activities. It is very similar, but if your thoughts are not on the surface, you will find that there is something different. Train tickets This matter, there are a lot of inquiries, check the time, check the seat, check the shop, a train does not, and check another train, it is accompanied by a large number of query operations, the order of the time required for the database operation. And second kill, direct Kill is good. In addition, about the second kill, can be made to accept only the first n users of the request (completely do not operate the backend of any data, only the user's next single operation log), this business, as long as the time of the various servers accurately synchronized on it, no need to operate any database at that time. The number of orders can be enough, stop seconds to kill, and then bulk write the database. Train tickets This is more than seconds to kill so simple. Can buy the ticket to tell the user at that time.
Third, some people take this system and the Olympic Games ticketing system comparison. I think it's still different. Although the Olympic Games ticketing system was also a line on the scrap. But the Olympic Games is the way to draw, that is, there is no first to get the way to rob, and, after the lottery, only need to collect information beforehand, do not need to ensure data consistency, no locks, it is easy to expand horizontally.
Four, booking system should be similar to the electronic Commerce Order system, are required to inventory: 1 occupy Inventory, 2 payment (optional), 3 deduct inventory operation. This is required to have a consistent check, that is, in the concurrency need to lock the data. The operator of the business will basically do this thing asynchronously, that is, your order is not immediately processed, but the delay processing, only successful processing, the system will give you a confirmation email that the order is successful. I believe that many of my friends have received emails that are unsuccessful. This means that data consistency is a bottleneck in concurrency.
Its five, the railway ticketing business is very sick, its use is a sudden release, and some votes are not enough to everyone points, so, we will have to rob tickets this business with Chinese characteristics. So when the pawn ticket is released, there will be millions of people or even tens of thousands of people to kill up, inquiries, orders. In 10 minutes, a website can accept tens of millions of of the traffic, this is very scary things. It is said that 12306 of the peak visit is 1 billion PV, concentrated in the early 8 o'clock to 10 points, PV in the peak of tens of millions per second.
Say a few more words:
Inventory is the nightmare of the Business-to-consumer, inventory management is quite complex. No, you can ask all the traditional and electric retailing businesses to see how hard it is for them to manage inventory. Otherwise, there would not be so many people asking about the inventory problem. (You can also read "Steve Jobs" and you'll know why Tim will take over Apple's CEO because he's got the stock problem.)
For a Web site, browsing the high load of the web is easy to handle, query the load has a certain degree of difficulty to deal with, but still can be done by caching the results of the query, the most difficult is the next single load. Because to access inventory Ah, for the next order, is basically done with asynchronous. Last year's double 11, Taobao's order number per hour is about 600,000, Beijing East Day can support 400,000 (incredibly worse than 12306), Amazon 5 years ago can support 700,000 order volume. Visible, the operation of the order is not as high as we are.
Taobao is much simpler than the website of the Business-to-consumer, because there is no warehouse, so there is no operation like the N warehouse to update and query the same inventory. At the time of the order, the website of the client will find a warehouse, close to the user, and have inventory, which requires a lot of calculation. Just imagine, you bought a book in Beijing, the warehouse in Beijing is out of stock, it is necessary to adjust from the surrounding warehouse, that is going to see Shenyang or XI ' an warehouse has no goods, if not, but also to see the warehouse in Jiangsu, and so on. Taobao is not so much, each merchant has its own inventory, inventory to the merchant's head, but conducive to performance.
Data consistency is the real performance bottleneck. Some say nginx can handle 100,000 static requests per second, I don't doubt that. But this is only static request, theoretical value, as long as the bandwidth, I/O strong enough, the server computing capacity, and support the number of concurrent connections can withstand the establishment of 100,000 TCP links, there is no problem. But in the face of data consistency, the 100,000 is completely an elusive theoretical value.
I say so much, I just want to tell you from the business, we need to really understand from the business of the Spring Festival railway booking such business metamorphosis.
Front-End performance optimization Technology
There are a number of common approaches to solving performance problems, and I'm sure the 12306 site uses these technologies to make a qualitative leap in performance.
First, the front-end load balancing
The user's access is evenly dispersed across multiple Web servers through DNS load balancers, typically based on routing load redirection on the router. This can reduce the request load on the Web server. Because HTTP requests are short jobs, you can do this by using a very simple load balancer. It is best to have a CDN network that allows users to connect to their nearest servers (CDN is usually accompanied by distributed storage). (For a more detailed description of load balancing, see "Back-end load Balancing")
Second, reduce the number of front-end links
I looked at 12306.cn, open the homepage need to build more than 60 HTTP connections, ticket booking page has more than 70 HTTP requests, now the browser is concurrent request. So, as long as there are 1 million users, there will be 60 million links, too much. A login Query page is good. The JS into a file, the CSS also dozen into a file, the icon also play into a file, with CSS block display. Minimize the number of links.
Third, reduce the page size increase bandwidth
The world is not a company that dares to do picture service, because the picture is too bandwidth-consuming. It's hard to see in the broadband age that people can't use pictures when it's time to make a page in the dial-up era (this is the case now on the mobile side). I looked at the 12,306 first page of the need to download the total file size around 900KB, if you have visited the browser will help you cache a lot, just download about 10K of files.
But we can imagine a case of extreme point, 1 million users at the same time access, and is the first visit, each download needs 1M, if need to return in 120 seconds, then need, 1M * 1m/120 * 8 = 66Gbps bandwidth. It's amazing. So I estimate that on the same day, 12306 of the congestion should basically be network bandwidth, so what you may see is no response. Behind the browser's cache help 12306 to reduce a lot of bandwidth consumption, so the load on the back end, the back-end of the data processing bottlenecks come out. So you'll see a lot of bugs like HTTP 500. This means that the server has collapsed.
Four, front page static
Statically change some of the pages and data, and gzip. Another way to do this is to put these static pages under SHM, the directory is memory, read the file directly from memory to return, which can reduce the expensive disk I/O.
V. Optimizing queries
A lot of people are looking at the same query, you can use the reverse proxy to merge these concurrent queries. Such technology mainly uses the query result cache to realize, the first query walks the database obtains the data, and puts the data to the cache, after the query all directly accesses the cache. Hash for each query, use NoSQL technology to complete this optimization. (This technique can also be used as a static page)
For the number of train ticket inquiries, the individual feel not to display numbers, show a "have" or "none" is good, this can greatly simplify the system complexity, and improve performance.
VI. Caching issues
Caching can be used to cache dynamic pages or to cache queried data. There are usually a few problems with caching:
1 cache update. Also called caching and database synchronization. There are several methods, one is the cache time out, so that the cache invalidation, re-check, second, by the back-end notification update, a number of changes in the backend, notify the front-end update. The former is simple to achieve, but the real time is not high, the latter is more complex to achieve, but the real time is high.
2 the cached page change. Memory may not be enough, so you need to swap some inactive data out of memory, which is similar to the operating system's memory swap and swap memory. FIFO, LRU, Lfu are all the more classic page-changing algorithm. Refer to the Wikipeida caching algorithm.
3 cache reconstruction and persistence. Caching in memory, the system will always maintain, so the cache will be lost, if the cache is gone, you need to rebuild, if the volume of data is very large, the process of cache reconstruction will be slow, this will affect the production environment, so, the persistence of the cache also need to consider.
Many powerful nosql are well supported by these three cache problems.
Back-end performance optimization technology
The front-end performance optimization techniques are discussed earlier, so the front-end may not be a bottleneck. Then the performance problem will come up to the backend data. Here are a few of the back-end common performance optimization techniques.
Data redundancy
With respect to data redundancy, that is to say, the data redundancy of our database, that is, reduce the cost of the table connection is relatively large operation, but this will sacrifice data consistency. The risk is relatively large. Many people used NoSQL as data, fast and fast because of data redundancy, but this is a big risk to data consistency. This requires analysis and processing based on different business. (Note: It's easy to migrate to NoSQL with relational databases, but it's hard to turn from NoSQL to relational type)
Ii. Data Mirroring
Almost all major databases support mirroring, or replication. The benefit of mirroring the database is that load balancing can be done. Divide the load of a single database into multiple platforms, while ensuring data consistency (SCN of Oracle). Most importantly, this can also have high availability, one of the waste, there is another one in the service.
Data consistency in data mirroring can be a complex problem, so we want to partition the data on a single piece of data, that is to say, the inventory of a best-selling commodity is divided into different servers, such as a best-selling commodity has 10,000 of the inventory, we can set up 10 servers, each server has 1000 inventory, It's like a warehouse for a business-to-consumer.
Iii. Data Partitioning
One problem that data mirroring cannot solve is that there are too many records in the data table, causing the database to operate too slowly. So, partition the data. There are a number of ways to partition data, generally in the following ways:
1 classify the data into some kind of logic. For example, train ticket booking system can be divided by the Railway Bureau, according to a variety of models, you can press departure points, you can divide by destination ..., anyway, it's a table that has the same fields but different kinds of tables, so that these tables can exist on different machines to share the load.
2 The data by the field points, that is, vertical table. For example, some of the data not often changed in a table, often changed the data in a number of other table. Turn a table into a 1 to 1 relationship, so that you can reduce the number of fields in the table, as well as improve certain performance. In addition, the number of fields can cause the storage of one record to be placed in different pages, which is problematic for both read and write performance. But there will be a lot of complicated control.
3) the average table. Because the first method is not necessarily evenly divided, there may be some kind of data or a lot. Therefore, there is also the use of the average distribution, through the primary key ID range to the table.
4 the same data partition. This is mentioned in the above data mirroring. That is, the inventory value of the same goods to different servers, such as 10,000 inventory, can be divided into 10 servers, a table has 1000 inventory. then load balanced.
These three kinds of partitions are good and bad. The most commonly used is the first. Once the data is partitioned, you need to have one or more schedules to let your front-end program know where to find the data. The data partition of the train ticket and put in each city and city, will have the very significant quality performance enhancement to 12306 this system.
Four, back-end system load balancing
Previously said the data partition, the data partition can reduce the load to some extent, but can not reduce the load of hot goods, for train tickets, can be considered a major city of some main routes on the ticket. This requires the use of data mirroring to mitigate the load. With data mirroring, you are bound to use load balancing, and at the back end, we may find it difficult to use a load balancer like a router, because that is a balanced flow because traffic does not represent how busy the server is. Therefore, we need a task allocation system that can also monitor the load on each server.
Task assignment server has some difficulties:
The load situation is more complex. What do you mean, busy? is the CPU high or disk I/O High or high memory usage or concurrent high? Or is the memory paging rate high? You may need to think about it all. This information is sent to the task allocator and is handled by the task dispatcher picking a server with the lightest load.
Task assignment server on task queue, cannot lose task Ah, so also need to persist. And you can assign tasks to compute servers in bulk.
What if the task-allocation server dies? There is a need for high availability technologies such as Live-standby or failover. We also need to be aware of the problem of how queues for persistent tasks are transferred to another server.
I see a lot of systems are distributed in a static way, some with hash, and some simply by turns analysis. These are not good enough, one is not perfect load balancing, another static method of the fatal flaw is that if there is a compute server panic, or we need to join a new server, for our allocator, we need to know.
Another approach is to use a preemptive way to load balance, from the downstream compute server to the task server to take the task. Let these compute servers decide for themselves whether they want a task. The benefit is to simplify the complexity of the system and to reduce or increase the number of compute servers in real time. But the only downside is that if there are some tasks that can only be handled on some kind of server, this may introduce some complexity. Overall, however, this approach may be a better load balancing.
V. Asynchronous, throttle and batch processing
Asynchronous, throttle (throttle valves) and batch processing all need to be queued for concurrent requests.
Asynchronous in the business generally is to collect requests, and then delay processing. It is technically possible to make each handler parallel, and it can be scaled horizontally. But there are probably a few asynchronous technical problems, a that are returned by the caller's results and involve communication between process threads. b if the program needs to be rolled back, the rollback can be a bit complicated. c) Asynchrony is often accompanied by multithreaded processes, and concurrent control is relatively cumbersome. D Many asynchronous systems use the message mechanism, the loss and chaos of the message can also be a more complex problem.
Throttle technology actually does not improve performance, this technology is mainly to prevent the system is more than it can not handle traffic to the collapse, which is actually a protection mechanism. Using throttle technology is generally a system that you cannot control, such as a banking system that is docked to your site.
The technique of batch processing is to bulk process a bunch of basic requests. For example, you buy the same product at the same time, there is no need for you to buy a I will write a database, can collect a certain number of requests, one operation. This technique can be used in many ways. For example, to conserve network bandwidth, we all know the MTU on the network (the Maximum transmission unit), the State Network is 1500 bytes, fiber can reach more than 4,000 bytes, if your network packet does not fill this MTU, it is wasting network bandwidth, because the driver of the network card only a piece of read efficiency will be high. Therefore, when the network contract, we need to collect enough information to do the network I/O, this is a way of batch processing. Batch processing of the enemy is low flow, so, batch processing system will generally set up two threshold, one is the operation volume, the other is timeout, as long as there is a condition to meet, will begin to submit processing.
So, as long as it is asynchronous, there will generally be throttle mechanism, there will be queues, there are queues, there will be persistence, and the system will generally use the bulk of the way to deal with.
The "queuing system" designed by Yunfeng is the technology. This is similar to the e-commerce order system, which means that my system received a request for your purchase ticket, but I have not really dealt with it, my system will throttle with my own processing ability to live these large requests, and 1.1 points to deal with. Once the processing is complete, I can send an email or text message telling the user that you can actually buy the ticket.
Here, I would like to discuss this queuing system of cloud wind students through business and user requirements, because it seems to solve the problem technically, but there may be something worth thinking about in terms of business and user requirements:
1 Dos attacks on queues
First, let's think, is this team a simple line? This is not good enough, because we can not eliminate the ox, and simple ticket_id very easy to have Dos attacks, for example, I launched N ticket_id, into the process of buying tickets, I do not buy, I will consume you half an hour, It's easy for me to let people who want to buy tickets not get tickets for days. Some people say that users should use identity cards to line up, so in the purchase will need to use this identity card to buy, but this also can not eliminate the ox line or the number of traffickers. Because they can register n account to queue, but just do not buy. Ox these people only need to do one thing at this time, the website makes the normal person cannot visit, lets the user only through them to buy.
2 Consistency of columns?
Does the operation of this queue require a lock? As long as there is a lock, performance must not go. Imagine that 1 million people are asking you to assign a location number and this queue will be a performance bottleneck. You must have no database to achieve good performance, so it may be worse than the present
3 Waiting time for queues
Is it enough for half an hour to buy a ticket? What if the user doesn't have access to the Internet? If the time is short, the user will complain if the time is long and the people in the queue will complain. This method may have many problems in practice. In addition, half an hour is too long, this is completely unrealistic, we use 15 minutes to give examples: there are 10 million users, every moment can only put in 10,000, this 10,000 users need 15 minutes to complete all operations, then, this 10 million users all processed, need 1000*15m = 250 hours, 10 days and a half, The train opened early. (I am not talking nonsense, according to the Ministry of Railways experts note: These days, the average day under 1 million, so, processing 10 million of users need 10 days.) This calculation may be a bit simple, I just want to say that in such a low load system with queuing may not solve the problem.
4 Distribution of queues
This queuing system has only one queue, okay? not good enough. Because if the person you put in can buy the ticket if you are buying the same type of ticket (such as a sleeper in a car), or if you are robbing a ticket, that means that the load of the system will probably focus on one of the servers. Therefore, the best approach is to queue users based on their needs-providing a place of origin and destination. In this way, the queue can be multiple, as long as multiple queues, you can expand horizontally.
I think it's perfectly possible to learn from online shopping. In line (order), collect the user's information and want to buy tickets, and allow users to set the priority to purchase tickets, for example, a train sleeper can not buy a berth to buy B train, if you can not buy to buy a hard seat and so on, and then the user to the money needed to recharge the first, then the system is completely automatic asynchronous processing orders. When success is unsuccessful, text messages or emails are sent to the user.
This system can not only save the half hour of user interaction time, automated expedited processing, but also can merge the same purchase ticket request, batch processing (reduce the number of database operations). The best thing about this approach is to know the needs of these queued users, not only can optimize the user's queue, the user distributed to different queues, but also like Amazon's wish list, let the Ministry of Railways to make arrangements and adjust the train (Finally, queuing system (the system) or to save in the database or to do the persistence, Not only in memory, otherwise the machine down, waiting to be scolded it.
Summary
Write so much, I summarize:
0 regardless of how you design, your system must be able to easily expand horizontally. That is, all the links in your entire data stream should be scaled horizontally. This way, when your system has a performance problem, "Add 3 times times the server" will not be ridiculed.
1 The above technology can not be fixed overnight, there is no long-term accumulation, basically hopeless. We can see that whatever you use will cause some complexity.
2 The centralized sale of tickets is difficult to handle, using the above technology can make booking system can have several hundred times the performance improvement. And in various provinces and cities to build separate stations, selling tickets separately, is to allow the existing system can have a qualitative upgrade the best way.
3) The eve of the Spring Festival Rush ticket and ticket volume for far less than seeking this business model is quite abnormal, so that tens of millions of or even hundreds of millions of people at the same time in the morning of 8 o'clock login at the same time to rob the ticket of this business model is abnormal in the metamorphosis. The Metamorphosis of the business form determines that no matter what they do, they will be scolded.
4 for a two-week-long system, and the rest of the time is idle, some pity, this is the railway talent to come out like this.