Large-scale concurrency of Web Systems-flash sales and flash sales for e-commerce and flash sales for web E-commerce
Flash sales and flash sales of e-businesses are not a stranger to us. However, from a technical point of view, this is a huge test for Web systems. When a Web system receives tens of thousands or more requests within one second, system optimization and stability are crucial. This time, we will focus on the technical implementation and optimization of flash sales and flash sales. At the same time, we will reveal the technical reasons why we are not always easy to get train tickets?
I. challenges brought by large-scale concurrency
In my past work, I have been faced with the high-concurrency flash sales feature of per second. In this process, the entire Web system encountered many problems and challenges. If the Web system is not optimized in a targeted manner, it will easily fall into an abnormal state. Let's discuss the optimization ideas and methods together.
1. Reasonable Design of request Interfaces
A flash sale or flash sale page is usually divided into two parts, one is static HTML and other content, and the other is the Web Background request interface involved in the flash sale.
Generally, static HTML and other content are deployed through CDN. The stress is not high, and the core bottleneck is actually on the backend request interface. This backend interface must support highly concurrent requests. At the same time, it is very important to return the user's request results as quickly as possible. To achieve this as quickly as possible, the interface's back-end storage uses memory-level operations better. It is not suitable to directly target storage such as MySQL. If you need such complex services, we recommend that you use Asynchronous writing.
Of course, there are also some flash sales and flash sales adopt "lagging feedback", that is to say, the flash sales do not know the results now. After a while, you can see from the page whether the flash sales are successful. However, this kind of behavior is "lazy" and provides poor user experience, which is easy to be considered as "black box operations" by users ".
2. High concurrency challenge: Be sure to "fast"
We usually use QPS (Query Per Second, number of requests processed Per Second) to measure the throughput of a Web system. This indicator is critical to solving tens of thousands of concurrent requests Per Second. For example, assume that the average response time for processing a business request is 100 ms, and 20 Apache Web servers are in the system, set MaxClients to 500 (indicating the maximum number of Apache connections ).
Then, the theoretical peak QPS of our Web system is (an idealized calculation method ):
20*500/0 .1 = 100000 (0.1 million QPS)
Why? Our system seems very powerful. We can process 0.1 million of requests in one second, and the/s seckilling seems to be "paper tiger. Of course, this is not ideal. In high-concurrency scenarios, machines are all in the high-load state, and the average response time will be greatly increased at this time.
For the Web server, the more processes that connect to Apache, the more context switches that need to be processed by the CPU, the more CPU consumption, and the average response time increases. Therefore, the number of maxclients mentioned above should be considered based on CPU, memory and other hardware factors. The larger the number, the better. You can use the abench provided by Apache to test and obtain a suitable value. Then, we select the Redis storage at the memory operation level. In the highly concurrent State, the storage response time is crucial. Although the network bandwidth is also a factor, such request packets are generally small and rarely become request bottlenecks. Server Load balancer becomes a bottleneck in the system. We will not discuss it here.
The problem arises. Assume that the average response time of our system in the 5 w/s high concurrency state is changed from 250 ms to ms (actual situation, or even more ):
20*500/0 .25 = 40000 (40 thousand QPS)
Therefore, our system has 4 million QPS left. In the face of 5 million requests per second, there is a difference of 1 million.
Then, this is the beginning of a real nightmare. For example, at a highway intersection, five vehicles run every second in one second, and the expressway operates normally. Suddenly, only four vehicles can pass through the intersection in one second, and the traffic volume remains the same. The result is a huge traffic jam. (5 lanes suddenly become 4 lanes)
Similarly, in a Certain Second, 20*500 available connection processes are working at full capacity, but there are still 10 thousand new requests, and no connection process is available, it is also expected that the system is in an abnormal state.
In fact, similar situations occur in normal non-high-concurrency business scenarios. A service request interface has a problem, and the response time is extremely slow, resulting in a long response time for the entire Web request, gradually fill up the number of available connections on the Web server, and other normal business requests, no connection process is available.
The more terrible problem is that the user's behavior is characteristic. The more the system is unavailable, the more frequently the user clicks, and the vicious circle eventually leads to an "Avalanche" (one of the Web machines goes down, as a result, the traffic is distributed to other normal machines, normal machines are also suspended, and then a vicious circle is executed), dragging the entire Web system down.
3. Restart and overload protection
If the system experiences an avalanche and the service is restarted rashly, the problem cannot be solved. The most common phenomenon is that it crashes immediately after it is started. At this time, it is best to reject the traffic at the entrance layer and then restart. If services such as redis and memcache are also suspended, you need to pay attention to "push" during restart, and it may take a long time.
In flash sales and flash sales scenarios, the traffic is often beyond our system's preparation and imagination. In this case, overload protection is necessary. If the system is fully loaded, request rejection is also a protection measure. Filtering on the front end is the easiest way, but this is a behavior that is referred to by the user. It is more appropriate to set the overload protection on the CGI entry layer to quickly return the customer's direct request.
Ii. Cheating means: Attack and Defense
Flash sales and flash sales received a "massive" request, which actually contained a lot of water. Many users, in order to "get the goods", will use auxiliary tools such as the "ticket farming tool" to help them send as many requests as possible to the server. Some other advanced users make powerful automatic request scripts. The reason for this is also very simple, that is, in the requests involved in flash sales and flash sales, the more requests you have, the higher the probability of success.
These are all "cheating methods". However, if there is an attack, there will be "defense". This is a battle without smoke.
1. One account sends multiple requests at a time
Some users use browser plug-ins or other tools to send hundreds or more requests at a time with their own accounts at the beginning of the second kill. In fact, such users undermine the fairness of flash sales and flash sales.
This kind of request may cause another type of damage in some systems that do not have data security processing, and some judgment conditions may be bypassed. For example, a simple collection logic is used to determine whether a user has participated in the recording. If not, the collection is successful and then written to the recording. This is a simple logic, but deep vulnerabilities exist in high concurrency scenarios. Multiple concurrent requests are distributed to multiple Web servers on the Intranet through the Server Load balancer. They first send query requests to the storage, and then, within the time difference when a request is successfully written to the participation record, the query results of other requests are "no participation records ". Here, there is a risk of logical judgment being bypassed.
Solution:
At the program entrance, one account can only accept one request, and other requests are filtered. It not only solves the problem of sending N requests to the same account, but also ensures the security of subsequent logical processes. The implementation scheme can write a flag through the memory cache service such as Redis (only one request is allowed to write successfully, combined with the optimistic lock feature of watch ), if the data is successfully written, you can continue to participate.
Alternatively, implement a service by yourself, put requests of the same account into a queue, process one, and then process the next one.
2. Multiple accounts send multiple requests at a time
The account registration function of many companies has almost no restrictions in the early stages of development. It is easy to register many accounts. As a result, some special studios have emerged. By writing Automatic Registration scripts, a large number of "zombie accounts" have been accumulated, ranging from tens of thousands or even hundreds of thousands of accounts, this is the source of "zombie powder" in Weibo ). For example, if tens of thousands of "zombie accounts" are used in a lottery transfer activity on Weibo, this will greatly increase the probability of winning the prize.
This type of account is used in flash sales and flash sales. For example, buy an iPhone official website and train ticket scalpers.
Solution:
In this scenario, you can solve the problem by detecting the IP request frequency of a specified Machine. If you find that an IP request frequency is very high, you can pop up a verification code or directly disable its request:
3. Send different requests to different IP addresses under multiple accounts
The so-called height of the road, the devil's feet. If there is an attack, there will be defense, and it will never be defeated. After discovering that you have control over the single-Host IP request frequency, these "Studios" have come up with their "New Attack solution" to address this scenario, that is, constantly changing the IP address.
You may be curious about how these random IP services come from. Some organizations occupy a batch of independent IP addresses and create a random proxy IP Address service, which is provided to these "Studios" for use. Another thing that is even more dark is that hackers use Trojans to hack into the computers of common users. This Trojan does not disrupt the normal operation of users' computers. Only one thing is to forward IP packets, normal users' computers are changed to IP proxy outlets. In this way, hackers get a large number of independent IP addresses and build services for random IP addresses to make money.
Solution:
To be honest, the requests in this scenario are basically the same as the actions of real users, and it is very difficult to distinguish them. Further restrictions can easily lead to "accidental damage" to real users. In this case, such requests can only be restricted by setting a high business threshold, or clear them in advance through "Data Mining" for account behavior.
Botnets also share some common features. For example, an account may belong to the same number segment or even a serial number, with low activity, low level, and incomplete information. Based on these features, set the participation threshold appropriately, such as limiting the account level involved in the second kill. Through these business methods, we can also filter out some botnets.
4. ticket purchases
Do you understand why you cannot get a train ticket? It's really hard for you to get tickets honestly. By using multiple accounts, the scalpers of train tickets occupy many places. Some powerful scalpers are "superior" in terms of verification codes.
For advanced scalpers, use real people to identify verification codes and build a Transit software service to show Verification Code pictures. Real people can browse the pictures and fill in the real verification codes, return to the intermediate software. In this way, verification code protection restrictions have been abolished, and there is no good solution at present.
Because the train ticket is based on the real name of the ID card, there is also a transfer operation for the train ticket. The general operation is to start a ticket grabbing tool with the buyer's ID card and send requests continuously. The scalper account number selects a refund, and then the scalper buyer successfully buys tickets through his ID card. When there is no pass in a carriage, not many people stare at it. Moreover, scalpers have powerful ticketing tools, even if we can see a refund, we may not be able to beat them.
Eventually, scalpers successfully transferred the train ticket to the buyer's ID card.
Solution:
There is no good solution. The only thing that can be thought of may be "Data Mining" for account data. These scalpers also share some common features, such as frequent ticket snatching and refund, abnormal holidays and so on. Analyze them for further processing and screening.
Iii. Data Security in high concurrency
We know that when multiple threads write data to the same file, there will be a "thread security" Problem (multiple threads run the same code at the same time, if the result of each running is the same as that of a single thread, the result is the same as expected, that is, thread security ). For MySQL databases, the built-in lock mechanism can be used to solve the problem. However, MySQL is not recommended in large-scale concurrency scenarios. In the flash sales and flash sales scenarios, another problem is "supersending". If you control this problem accidentally, too many messages will be sent. We have also heard that some e-commerce companies are engaged in flash sales activities. After the buyers take photos, the merchants refuse to accept that the orders are valid and therefore reject delivery. The problem here may not necessarily be due to fraud by sellers, but due to the risk of excessive exploitation at the system technical level.
1. Reasons for supersending
Assume that in a flash sale scenario, we only have 100 items. At the last moment, we have consumed 99 items, with only the last one. At this time, the system sends multiple concurrent requests. The total number of goods read by these requests is 99, and then all of them pass this margin judgment, which eventually leads to the supersending. (Same as the scenario mentioned above)
In the above figure, concurrent user B is also "snapped up successfully", and many people get the goods. This scenario is very prone to high concurrency.
2. pessimistic lock ideas
There are many ways to solve thread security. We can start from the "pessimistic lock" direction.
Pessimistic lock, that is, when you modify data, the lock status is used to reject modifications to external requests. In the locked status, you must wait.
Although the above solution does solve the thread security problem, do not forget that our scenario is "high concurrency ". That is to say, there will be many such modification requests, each request needs to wait for the "Lock", some threads may never have the opportunity to grab the "Lock", and such requests will die there. At the same time, there will be many such requests, and the average response time of the system will be instantly increased. The result is that the number of available connections is exhausted, and the system is in an exception.
3. FIFO queue ideas
Well, let's modify the above scenario a little. We put the request into the queue directly and adopt FIFO (First Input First Output, First-in-First-out). In this case, we will not cause some requests to never get the lock. I can see if it is a bit hard to change multithreading into a single thread.
Then, we have solved the lock problem. All requests are handled in the "first-in-first-out" queue mode. The new problem arises. In high concurrency scenarios, the queue memory may be burst in an instant due to a large number of requests, and then the system is in an abnormal state. You can also design a large memory queue. However, the speed at which the system processes requests in a queue cannot be compared with the number of requests in the queue. That is to say, the more requests in the queue, the larger the average response time of the Web system, and the exception.
4. Optimistic Locking ideas
At this time, we can discuss the idea of "optimistic lock. Optimistic locks adopt a more loose locking mechanism than pessimistic locks. Most of them adopt newer versions. This means that all requests for this data are qualified to be modified, but a version number of the data will be obtained. Only requests with the correct version number can be updated successfully, and other requests will fail to be returned. In this way, we do not need to consider the queue issue, but it will increase the CPU computing overhead. However, this is a good solution.
Many software and services support the optimistic lock function. For example, watch in Redis is one of them. This ensures data security.
Iv. Summary
The Internet is developing rapidly. The more users use Internet services, the more concurrent applications. Flash sales and flash sales are two typical high-concurrency Internet scenarios. Although our specific technical solutions for solving the problem may vary widely, the challenges we encounter are similar. Therefore, our solutions are similar.
Reprinted from: http://www.csdn.net/article/2014-11-28/2822858