A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
E-commerce's second kill and snapped up, for us, is not a strange thing. However, from a technical point of view, this is a huge test for web systems. When a web system receives tens or even more requests within a second, the optimization and stability of the system is critical. This time we will focus on the technology implementation and optimization of the second kill and snapping, at the same time, from the technical level, why we are always not easy to rob the reason for the train ticket?
I. Challenges posed by large-scale concurrency
In the past work, I have been faced with 5w per second high concurrent kill function, in this process, the entire web system encountered a lot of problems and challenges. If the web system does not do a targeted optimization, it will easily fall into an abnormal state. We are now going to discuss the ideas and methods of optimization, ha.
1. Reasonable design of the request interface
A second kill or snapping up the page, usually divided into 2 parts, one is static HTML and other content, the other is to participate in the second Kill Web background request interface.
Usually static HTML and other content, is through the deployment of CDN, general pressure is not big, the core bottleneck is actually in the background request interface. This backend interface must be able to support high concurrent requests, and, at the same time, it is important to be as "fast" as possible to return the user's request results in the shortest amount of time. To achieve this as quickly as possible, the backend storage of the interface uses memory-level operations to be a little better. Storage that is still directly oriented to MySQL is inappropriate, and asynchronous writes are recommended if there is a need for this complex business.
Of course, there are some seconds to kill and snapped up using "lag feedback", that is, the second kill now do not know the results, a period of time before you can see from the page users whether the second kill success. However, this is "lazy" behavior, but also to the user's experience is not good, easy to be considered by users as "black-box operation."
2. High Concurrency Challenge: Be sure to "fast"
We usually measure the throughput rate of a web system as a QPS (Query per Second, processing requests per second), which resolves a high concurrency scenario of tens of thousands of times per second, which is critical. For example, we assume that the average response time for a business request is 100ms, and that there are 20 Apache Web servers in the system with a configuration of maxclients of 500 (representing the maximum number of Apache connections).
So, the theoretical peak of our web system is the QPS (idealized calculation method):
20*500/0.1 = 100000 (100,000 QPS)
Hey? Our system seems very powerful, 1 seconds can handle 100,000 of the request, 5W/S's second kill seems to be "paper Tiger" ha. The reality, of course, is not so ideal. In the case of high concurrency, the machine is in a state of high load, at which time the average response time is greatly increased.
As far as Web servers are concerned, the more connection processes Apache opens, the more context switches the CPU needs to handle, the additional CPU consumption, and the resulting increase in average response time. Therefore, the above number of maxclient, according to the CPU, memory and other hardware factors synthetically consider, definitely not more the better. You can test it with Apache's own abench and take a suitable value. Then, we select the memory operation level of the storage redis, in high concurrency state, the storage response time is critical. Although network bandwidth is also a factor, this request packet is generally small and rarely becomes the bottleneck of the request. Load balancing becomes a bottleneck in the system, and there are few discussions here.
So here's the problem, assuming our system, in the high concurrency of 5w/s, the average response time changes from 100ms to 250ms (actual, even more):
20*500/0.25 = 40000 (40,000 QPS)
So, our system left a 4w of QPS, facing 5w per second request, the middle of the difference between 1w.
And then, this is the beginning of a real nightmare. For example, a high-speed junction, 1 seconds to 5 cars, 5 cars per second, high-speed junction operation is normal. Suddenly, this junction 1 seconds only through 4 cars, the flow is still the same, the result must be a big traffic jam. (5 lanes suddenly become 4 lanes of feeling)
In the same vein, within a single second, the 20*500 available connection process is in full workload, but there are still 10,000 new requests, no connection process available, the system falls into an abnormal state is also expected within.
In fact, in the normal non-high concurrency of the business scenario, there is a similar situation, a business request interface problems, response time is very slow, the entire Web request response time to pull a long, gradually the Web server to fill the number of available connections, other normal business requests, no connection process available.
The more frightening problem is that the user's behavior characteristics, the more the system is not available, user clicks more frequently, the vicious circle eventually led to an "avalanche" (one of the web machine hangs, resulting in the spread of traffic to other normal working machines, resulting in the normal machine also hangs, and then a vicious circle), the entire web system is dragged down.
3. Restart and overload protection
If the system occurs "avalanche", the abrupt restart of services, is not able to solve the problem. The most common phenomenon is that when you start up, you hang up right away. At this time, it is best to reject the traffic at the ingress layer before restarting. If it is redis/memcache this service also hangs, restart the time need to pay attention to "preheating", and it is likely to take a long time.
Seconds to kill and snapped up the scene, the flow is often more than our system of preparation and imagination. At this time, overload protection is necessary. Rejecting a request is also a protective measure if the system is fully loaded. Setting up filtering on the front end is the simplest way, however, this practice is the behavior of "CHOUFSO" by the user. More appropriately, the overload protection is set at the CGI entry layer, which quickly returns the client's direct request.
Second, the means of cheating: offense and defense
Seconds to kill and snapped up received a "massive" request, in fact, the moisture inside is very large. Many users, in order to "Rob" to the merchandise, will use "Brush ticket tool" and other types of auxiliary tools, help them to send as many requests to the server. There are also a subset of advanced users who make powerful automatic request scripts. The reason for this is also very simple, that is, in the request to participate in the second kill and snapped up, the more the number of their requests, the higher the probability of success.
These are "cheating means", however, there is "offensive" there is a "defensive", this is a fight without smoke, Kazakhstan.
1. Same account, make multiple requests at once
Some users through the browser plug-in or other tools, in the start of the second kill time, with their own account, send hundreds or even more requests at once. In fact, such users undermine the fairness of the second kill and snapping.
This kind of request can also cause another kind of damage in some systems that do not have data security processing, which leads to some judgment condition being bypassed. For example, a simple pick-up logic, first to determine whether the user has a record of participation, if not the successful collection, and finally write to the participation record. This is a very simple logic, but in high concurrency scenarios, there are deep holes. Multiple concurrent requests are assigned to multiple Web servers in the intranet through a Load Balancer server, which first sends a query request to the store, and then, within the time difference that a request is successfully written to the participating record, the other requests are queried for "no participation in the record." Here, there is the risk that logical judgments are bypassed.
At the entrance of the program, an account is allowed to accept only 1 requests, and other requests are filtered. It not only solves the problem of the same account, sends n requests, but also guarantees the security of the subsequent logic flow. The implementation scheme can be written to a flag bit via Redis's memory Caching service (only 1 requests are allowed to write successfully, combined with Watch's optimistic lock feature), and successful writes can continue to participate.
Or, implement a service yourself, put a request for the same account into one queue, process one, and then process the next.
2. Multiple accounts, send multiple requests at once
Many of the company's account registration functions, in the early stages of development is almost unlimited, it is easy to register a number of accounts. As a result, there have been some special studios, through the writing of automatic registration scripts, accumulated a large number of "zombie accounts", a huge amount, tens of thousands of or even hundreds of thousands of of the accounts, specifically to do a variety of brush behavior (this is the microblog "zombie powder" source). For example, if there is an event in the microblog that forwards the sweepstakes, if we use tens of thousands of zombies to go in and forward it, we can greatly increase our chances of winning.
This account, used in the second kill and snapped, is also the same reason. For example, the iphone's official purchase, train ticket scalpers.
This scenario can be resolved by detecting the IP request frequency of the specified machine, and if an IP request is found to be very high, you can either pop up a verification code or simply block its request:
3. Multiple accounts, different IP send different requests
The so-called however persuasive, outsmart. There is offense, there will be defensive, never rest. These "studios", found that you have control over the frequency of single-machine IP requests, they also aimed at this scenario, they came up with their "New attack plan" is to constantly change the IP.
There are classmates curious, how these random IP services come. Some institutions themselves occupy a number of independent IP, and then made a random proxy IP services, paid for these "studio" use. There are some more dark, is through the Trojan black off the ordinary user's computer, this Trojan does not destroy the normal operation of the user's computer, only do one thing, is to forward the IP packet, ordinary users of the computer has been turned into an IP proxy export. This way, hackers get a lot of independent IP, and then set up as a random IP services, is to make money.
To tell the truth, the request in this scenario, and the behavior of the real user, has been basically the same, it is difficult to distinguish. Further restrictions can easily "hurt" real users, at this time, usually only by setting the business threshold high to limit the request, or through the account behavior of "data mining" to clear them early.
Zombie accounts also have some common features, such as the account is likely to belong to the same number segment or even number, inactive, low level, incomplete data and so on. According to these characteristics, the appropriate set of participation thresholds, such as limiting the number of seconds to participate in the Kill account level. With these business tools, you can also filter out some zombie numbers.
4. Buying of train tickets
See here, do students understand why you can't get the train ticket? It's really hard if you're just going to rob the tickets honestly. Through the way of multi-account, train tickets of the ox will be a lot of tickets occupy, some powerful cattle, in the processing of verification code, is "outmanoeuvred".
Advanced OX Brush Ticket, in the identification code when the use of real people, in the middle of a display verification code picture of the relay software services, live browse pictures and fill in the real verification code, return to the transit software. In this way, the protection restrictions of the verification code are abolished, and there is no good solution at present.
Because the fire ticket is based on the real-name card, there is a train ticket transfer operation mode. The general way of operation is to use the buyer's identity card to open a ticket to the tool, continue to send the request, the Ox account selection refund, and then the Ox buyer successfully through their own identity card to buy tickets successfully. When there is no ticket in a train, there is not a lot of people staring at, and the cattle are also very powerful Rob ticket tool, even if we see a refund, we may not be able to rob them ha.
Eventually, the Ox successfully transferred the ticket to the buyer's identity card.
There is no good solution, the only thing that can be thought of is the account data for "data mining", these cattle accounts are also some common characteristics, such as frequent robbery tickets and refunds, holidays unusually active and so on. Analyze them, and then do further processing and screening.
Third, high concurrency of data security
We know that when multithreading writes to the same file, there is a "thread-safe" problem (multiple threads running the same piece of code at the same time, if the results of each run are the same as the result of a single-threaded run, the result is thread-safe as expected). If it is a MySQL database, you can use its own lock mechanism to solve the problem well, but in large-scale concurrent scenarios, it is not recommended to use MySQL. Second kill and snapped in the scene, there is another problem, is the "super hair", if the control inadvertently in this area, will produce too much to send the situation. We have also heard that some e-commerce buying activities, buyers after the success of the film, the merchant did not admit that the order is valid, refused to ship. The problem here, perhaps not necessarily is the merchant treacherous, but the system of technical aspects of the risk caused by the ultra-fat.
1. Causes of Super Hair
Let's say we have only 100 items in a snapping scene, and at the last minute we've consumed 99 items and only the last one left. At this time, the system sent a number of concurrent requests, this batch of requests read the product margin is 99, and then all passed this margin judgment, eventually lead to super-fat. (in the same scenario as in the previous article)
In the above diagram, it led to concurrent User B also "snapping up success", more people get a product. This scenario is very easy to appear in high concurrency situations.
2. Pessimistic locking ideas
There are many ways to solve thread safety, which can be discussed in the direction of pessimistic locking.
Pessimistic lock, that is, when modifying the data, the use of locking state, the exclusion of external request modification. When a lock is encountered, it must wait.
While the above scenario does solve the problem of thread safety, don't forget that our scenario is "high concurrency." That is, there will be a lot of such modification requests, each of which needs to wait for a "lock", and some threads may never get a chance to grab the "lock", and the request will die there. At the same time, this kind of request will be many, the average response time of the system increases, the result is that the number of available connections is exhausted and the system falls into an anomaly.
3. FIFO Queue ideas
Well, then we'll just change the scene a little bit, and we'll put the request directly into the queue, using the FIFO (first Input, Output, FIFO), so we don't cause some requests to never get locks. See here, is not a bit forced to turn multithreading into a single-threaded feeling ha.
Then, we now solve the lock problem, all requests to use "FIFO" queue mode to handle. So the new problem comes, high concurrency scenario, because the request is many, it is likely that the queue memory "burst" in a flash, and then the system fell into an abnormal state. Or designing a huge memory queue is also a scenario, but the speed at which the system is processing a request within a queue simply cannot be compared to the number of crazy influx queues. In other words, the more accumulated the requests in the queue, the worse the average response time of the web system is, and the system is stuck in an exception.
4. Optimistic Locking ideas
At this time, we can discuss the idea of "optimistic lock". Optimistic lock, is relative to the "pessimistic lock" with a more relaxed locking mechanism, mostly with version number (versions) update. The implementation is that this data all requests are eligible to modify, but will get a version number of the data, only the version number of the match can be updated successfully, the other return snapping failed. In this case, we don't need to think about the queue, but it will increase the CPU's computational overhead. However, in general, this is a better solution.
There are many software and services that are supported by the "optimistic lock" feature, such as Watch in Redis, which is one of them. Through this implementation, we guarantee the security of the data.
The internet is developing at a high speed, and the more users who use the Internet service, the more high-concurrency scenarios become. E-commerce second kill and snapping, is two more typical internet high concurrency scenarios. While our specific technical solutions to the problem may vary, the challenges are similar, and so are the approaches to solving the problem.
Xu Hanbin: Large-scale concurrency of web systems-e-commerce second kill and snapping (technology implementation)
Start building with 50+ products and up to 12 months usage for Elastic Compute Service