"Guide" Xu Hanbin has been in Alibaba and Tencent engaged in more than 4 years of technical research and development work, responsible for the daily request over billion web system upgrades and refactoring, at present in Xiaoman technology entrepreneurship, engaged in SaaS service technology construction.
The electric dealer's second kill and buys, to us, is not a strange thing. However, from a technical standpoint, this is a great test for the web system. When a web system receives tens or even more requests in a second, system optimization and stability are critical. This time we will focus on the second kill and snapping technology implementation and optimization, at the same time, from the technical level, why we are always not easy to get the train ticket reasons?
The challenges of large-scale concurrency
In the past work, I have faced with 5w per second high concurrent seconds kill function, in this process, the entire web system has encountered many problems and challenges. If the web system does not make targeted optimizations, it can easily fall into the abnormal state. Now let's talk about the idea and method of optimization.
1. Reasonable design of Request interface
A second kill or snapped page, usually divided into 2 parts, one is static HTML content, the other is involved in the second Kill Web background request interface.
Usually static HTML and other content, is through the deployment of CDN, the general pressure is not, the core bottleneck is actually in the background request interface. This backend interface must be able to support high concurrent requests, and at the same time, it is important to be as fast as possible and to return the user's request results in the shortest amount of time. In order to achieve this as quickly as possible, the back-end storage of the interface is better to use memory-level operations. Storage that is still directly oriented to MySQL is not appropriate, and it is recommended that asynchronous writes be used if there is a need for this complex business.
Of course, there are also some seconds to kill and buy the use of "lag feedback", that is, seconds to kill now do not know the results, a period of time before you can see from the page whether the user seconds kill success. However, this kind of "lazy" behavior, but also to the user experience is not good, easy to be considered by the user is "black box operation."
2. High concurrent challenges: Be sure to "fast"
We usually measure the throughput of a web system is QPS (Query per Second, processing requests every second), solve tens of thousands of times per second high concurrency scenario, this metric is critical. For example, we assume that the average response time for a business request is 100ms, and that there are 20 Apache Web servers in the system, with a configuration of maxclients of 500 (representing the maximum number of connections to Apache).
So, the theoretical peak of our web system is QPS (idealized calculation):
20*500/0.1 = 100000 (100,000 QPS)
Hey? Our system seems to be strong, 1 seconds to handle 100,000 of the request, 5W/S's second kill seems to be "paper Tiger" ha. The actual situation, of course, is not so ideal. In a high concurrency scenario, the machine is in a high load state, at which point the average response time is greatly increased.
As far as the Web server is concerned, the more connected processes the Apache opens, the more context switches the CPU needs to handle, the additional CPU consumption, and then the direct increase in the average response time. Therefore, the above maxclient number, according to CPU, memory and other hardware factors, not the more the better. You can test it with the Abench from Apache and take a suitable value. We then select the memory-level storage Redis, which is critical in the high concurrency state. Although network bandwidth is also a factor, however, this request packet is generally relatively small, generally rarely become the bottleneck of the request. Load balancing is less of a system bottleneck and is not discussed here.
So the question is, assuming our system, in the high concurrency state of the 5w/s, the average response time from 100ms to 250ms (actual, even more):
20*500/0.25 = 40000 (40,000 QPS)
As a result, our system left the 4w QPS, the face of 5w per second request, the middle of the difference between 1w.
Then, this is the real nightmare to begin with. For example, high-speed intersection, 1 seconds to 5 vehicles, 5 vehicles per second, high-speed junction operation is normal. Suddenly, this intersection 1 seconds only through 4 vehicles, the flow of traffic is still, the result must be a big jam. (5 lanes suddenly become 4 lanes of feeling)
Similarly, within a second, the 20*500 available connection processes are at full load, but there are still 10,000 new requests, no connection processes available, and the system is expected to fall into the abnormal state.
In a normal, non-high concurrency business scenario, there is a similar situation where a business request interface is problematic, response time is very slow, the entire Web request response time is pulled long, the Web server is gradually filled with the number of available connections, other normal business requests, no connection process available.
The more terrible problem is that the behavior of the user is characteristic, the more unavailable The system is, the more frequent the user clicks, the vicious circle eventually leads to "avalanches" (one web machine hangs, causing traffic to spread to other working machines, causing the normal machines to hang, and then the vicious cycle), bringing down the entire web system.
3. Restart and overload protection
If the system occurs "avalanche", hastily restart the service, is unable to solve the problem. The most common phenomenon is that, after starting up, immediately hung up. At this time, it is best to reject traffic at the entry level and then reboot. If it is redis/memcache this service is also hung up, you need to pay attention to "warm up" when restarting, and it is likely to take a long time.
Second kill and snapping scenes, traffic is often beyond our system of preparation and imagination. At this time, overload protection is necessary. A denial of request is also a protection measure if the system is detected as full load. Setting the filter on the front end is the easiest way to do it, but the behavior is "condemnation" by the user. More appropriately, the overload protection is set at the CGI entry layer to quickly return the customer's direct request.
Ii. the means of cheating: offense and defense
Seconds to kill and snapped up received a "massive" request, in fact, the water is very large. Many users, in order to "grab" the merchandise, will use the "Brush ticket tool" and other types of assistive tools to help them send as many requests to the server. There is also a subset of advanced users who make powerful automated request scripts. The rationale for this practice is also simple: in the request to participate in the second kill and snapped up, the number of their own requests accounted for more, the higher the probability of success.
These are "cheating means", however, there is "offensive" there is "defensive", this is a battle without smoke.
1. The same account, a one-time issue of multiple requests
Some users through the browser plug-ins or other tools, in the beginning of the second kill time, to their own account, send hundreds or even more requests. In fact, such users destroy the fairness of second kill and snapping.
This request can also cause another kind of damage in some systems that do not have data security processing, causing some judgment conditions to be bypassed. For example, a simple pick logic, first to determine whether the user has participated in the record, if not to obtain success, and finally write to the participation record. This is a very simple logic, but in high concurrency scenarios, there are deep vulnerabilities. Multiple concurrent requests are distributed to multiple Web servers in the intranet through a load-balancing server, which sends a query request to the store, and then, in the time lag when a request is successfully written to the participating record, the other requests are "not participating in the record". Here, there is the risk of a logical judgment being bypassed.
Response plan:
At the entrance of the program, an account is allowed to accept only 1 requests and other requests to filter. Not only solve the same account, send n request questions, but also ensure the subsequent logical process of security. Implementation, you can write a flag bit by Redis this memory cache service (only allow 1 requests to write successfully, combined with watch optimistic lock characteristics), the successful write can continue to participate.
Or, implement a service yourself, put the request of the same account into a queue, process one, and then process the next.
2. Multiple accounts, send multiple requests at once
Many companies account registration function, in the early development is almost unlimited, it is easy to register many accounts. Therefore, also led to the emergence of a number of special studios, by writing automatic registration scripts, accumulated a large number of "zombie account", a huge amount, tens of thousands of or even hundreds of thousands of of the account range, specialized in all kinds of brush behavior (this is the Micro-blog "zombie powder" source). For instance, for example, there is a forward lottery in the microblog, if we use tens of thousands of "zombie number" to go into the forwarding, so that we can greatly improve the probability of winning the lottery.
This account, used in the second kill and snapping, is the same reason. For example, the iphone's official website snapped up, train ticket scalpers.
Response plan:
This scenario can be resolved by detecting the frequency of the specified machine IP request, and if an IP request is found to be a high frequency, it can be ejected with a captcha or a direct prohibition of its request:
Pop-up verification code, the most core pursuit is to identify the real user. Therefore, we may often find that the site pop-up verification code, some are "ghosts and dance" appearance, sometimes let us simply can not see. The reason they do this is also to make the image of the verification code not easily recognized, because the powerful "automatic script" can be used to identify the characters in the image, and then let the script automatically fill in the Verification code. In fact, there are some very innovative verification code, the effect will be better, for example, to give you a simple question to answer, or let you complete some simple operations (such as Baidu Post Bar Verification code). The direct prohibition of IP, in fact, is somewhat rude, because some real users of the network scene is exactly the same export IP, there may be "accidental injury." However, this approach is simple and efficient, and can be achieved with good results based on actual scenarios.
3. Multiple accounts, different IP send different requests
The so-called villains, outsmart. There are attacks, there will be defense, never rest. These "studio", found that you have a single IP request frequency control, they also for this scenario, came up with their "New attack plan" is to constantly change the IP.
There are classmates wondering how these random IP services come in. Some of the agencies themselves occupy a group of independent IP, and then made a random proxy IP services, paid to these "studio" use. There are some more dark, that is, through the Trojan black ordinary users of the computer, this Trojan does not damage the normal operation of the user's computer, only to do one thing, that is, forwarding IP packets, ordinary users of the computer has become an IP agent export. In this way, hackers get a lot of independent IP, and then set up for random IP services, is to make money.
Response plan:
To tell the truth, this kind of scene request, and the real user's behavior, already basically the same, wants to make the discrimination very difficult. Further restrictions are easy to "accidentally hurt" the real user, this time, usually only by setting the high threshold of the business to limit the request, or through the account behavior of the "data mining" to clean up early.
Zombie account also has some common characteristics, such as the account is likely to belong to the same number or even the number, the active degree is not high, low grade, incomplete data and so on. According to these characteristics, appropriate to set the threshold for participation, such as limiting the number of seconds to participate in the account level. Through these business tools, it is also possible to filter out some zombie numbers.
4. Train ticket Buying
See here, do students understand why you can't get train tickets? It's really hard to get tickets if you just honestly. Through the way of multiple accounts, train ticket cattle will be a lot of tickets occupied, some of the powerful cattle, in the processing of verification code, is "outmanoeuvred".
High-grade yellow cattle brush tickets, in the identification of the code when the use of real people, the middle of a demonstration of the image of the software to display the relay service, the real person to browse the picture and fill out the real verification code, return to the relay software. In this way, the protection restriction of the verification code is abolished and there is no good solution at present.
Because the fire ticket is based on the ID card, there is also a train ticket transfer operation mode. The approximate mode of operation is to use the ID card of the buyer to open a grab ticket tool, continue to send the request, the Ox account choice refund, and then the cattle buyers successfully through their own identity card ticket success. When a car has no ticket, is not a lot of people stare at, and the yellow cattle ticket tools are also very powerful, even if we see a refund, we may not be able to rob them ha.
Finally, the ox smoothly transfer the fire ticket to the buyer's identity card.
Solution:
And there is no good solution, the only thing that can be thought of is the account data for "data mining", the Yellow cattle account has some common characteristics, such as often grab tickets and refunds, unusually active holidays and so on. Analyze them for further processing and screening.
Data security under high concurrency
We know that when multithreaded writing to the same file, there is a "thread-safe" problem (multiple threads running the same piece of code at the same time, if the result of each run and single-threaded run is the same, the result is the same as expected, is thread-safe). If it is a MySQL database, you can use its own locking mechanism to solve the problem, but in large-scale concurrent scenarios, is not recommended to use MySQL. Second kill and snapped up the scene, there is another problem, that is, "super hair", if the control inadvertently in this area, will produce too much to send. We have also heard that some electric dealers to buy activities, the buyer successfully photographed, the merchant did not recognize the order is valid, refused to ship. The problem here, perhaps not necessarily is the businessman treacherous, but the system technical level has the excess risk to cause.
1. The reason for the super hair
Suppose we have a total of 100 items in a snapping scene, at the last minute we have consumed 99 items, and only the last one. This time, the system sent a number of concurrent requests, this batch of requests to read the merchandise margin is 99, and then all passed this one margin judgment, resulting in a super hair. (The scene described earlier in the article)
In the above diagram, the concurrent User B is also "snapped up", allowing a person to get the product. This scenario is very easy to appear in high concurrency situations.
2. Pessimistic lock thinking
There are many ways to solve thread safety, which can be discussed in the direction of "pessimistic lock".
Pessimistic lock, that is, in the modification of data, the use of lock state, the exclusion of external requests for modification. When you encounter a lock state, you must wait.
While the above solution does solve thread-safety problems, don't forget that our scenario is "high concurrency." In other words, there are many such requests for modification, each of which needs to wait for the "lock", and some threads may never get a chance to grab the "lock", and the request will die there. At the same time, this request will be many, the instantaneous increase in the system's average response time, the result is the number of available connections are depleted, the system into an abnormal.
3. FIFO Queue idea
Well, then let's change the above scene a little bit, and we'll just put the request into the queue, using FIFO (First Input Output, first-in first-out), so that we don't cause some requests to never get a lock. See here, is not a bit of a brute force to become a single thread of the Feeling ha.
Then, we now solve the problem of the lock, all requests in the "FIFO" queue to deal with. So the new problem comes, high concurrency scenario, because a lot of requests, it is likely that the queue memory "burst", and then the system into an abnormal state. or design a large memory queue, but also a scheme, but the system processing a queue within the speed of the request can not be compared to the number of crazy influx queue. That is, the more the requests within the queue accumulate, the more likely the final web system will fall in the average response, and the system is still in the abnormal.
4. Optimistic Lock thinking
This time, we can discuss the "optimistic lock" mentality. Optimistic lock, is relative to the "pessimistic lock" with a more relaxed locking mechanism, most of the use of a version of the update. The implementation is that this data all requests are eligible to modify, but will get a version of the data, only the version number of the match can be updated successfully, other return snapped up failed. In this case, we don't need to consider the queue problem, but it increases the CPU's computational overhead. But, overall, this is a better solution.
There are many software and services that support optimistic locking, such as watch in Redis. Through this implementation, we guarantee the security of the data.
Iv. Summary
The internet is developing at a high speed, with more users using Internet services and more concurrent scenarios. Electric trader second kill and snapped up, is two more typical internet high concurrent scene. While the specific technical solutions to our problems may vary, the challenges we face are similar, and so are the solutions to the problem.
More "Ask the bottom" content
"Ask the Bottom" 严澜: Introduction to Data Mining (I.)--Participle
"Ask the bottom" Yao Yu talk about Twitter's hundred TB-level Redis caching practices "Ask the bottom" Shuai: Deep php Kernel (i)--Weak type variable principle inquiry "Ask bottom" Shuai: Deep PHP Kernel (ii)--SAPI explore
"Ask Bottom" Shuai: Deep PHP Kernel (iii)--kernel sharp weapon hash table and hash collision attack
"Ask the Bottom" static line: Fastjson realization
"Ask the Bottom" Li Ping: The soul of a large web site-performance "Ask the Bottom" Xu Peng: Use Spark+cassandra to build high-performance data analysis Platform (i)
"Ask the Bottom" Xu Peng: Use Spark+cassandra to build high-performance data analysis Platform (ii)
"Ask the Bottom" Xu Hanbin: Caching mechanism and geometry fractal study of large-scale website architecture
"Ask the Bottom" Xu Hanbin: Billion class web system build--stand-alone to distributed cluster
"Ask the Bottom" Xu Hanbin: The large-scale concurrent of the web system--electric business second kill with snapping up
"Ask the bottom" is csdn cloud computing channel New column, take the practice as this, share individual to the new Age software architecture and research and development deep insight. In the article containing the title of "Ask the bottom", you will see a foreign it giant's architecture sharing, will see the domestic senior engineers to a technology practice summary, will see a series of new technology exploration. "Ask the bottom" invite to have a unique/insightful technology with you to create a technology-only sky, details can be mail to zhonghao@csdn.net.
CSDN invites you to participate in China's large data award-winning survey activities, just answer 23 questions will have the opportunity to obtain the highest value of 2700 Yuan Award (a total of 10), speed to participate in it!
National Large data Innovation project selection activities are also in full swing, details click here.
The 2014 China Large Data Technology Conference (Marvell conference 2014,BDTC 2014) will be held at Crowne Plaza Hotel, New Yunnan, December 12, 2014 14th. Heritage since 2008, after seven precipitation, "China's large Data technology conference" is currently the most influential, the largest large-scale data field technology event. At this session, you will not only be able to learn about Apache Hadoop submitter uma maheswara Rao G (a member of the project Management Committee), Yi Liu, and members of the Apache Hadoop and Tez Project Management Committee Bikas Saha and other shares of the general large data open source project of the latest achievements and development trends, but also from Tencent, Ali, Cloudera, LinkedIn, NetEase and other institutions of the dozens of dry goods to share. There are a few discount tickets for the current ticket purchase.
Free Subscribe to the "CSDN large data" micro-letter public number, real-time understanding of the latest big data progress!
CSDN large data, focus on large data information, technology and experience sharing and discussion, to provide Hadoop, Spark, Impala, Storm, HBase, MongoDB, SOLR, machine learning, intelligent algorithms and other related large data views, large data technology, large data platform, large data practice , large data industry information and other services.