The framework analysis and actual combat _android of the second kill system in limited-timed buying

Last Update:2017-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 seconds to kill business analysis

Normal e-business process

(1) Inquiry goods, (2) Create Order, (3) deduct inventory, (4) Update order, (5) payment, (6) Seller issue

The characteristics of the second kill business

(1) Low price, (2) Large scale promotion, (3) instantaneous short selling, (4) usually on-time shelves, (5) shorter times and high instantaneous concurrency;

2 seconds to kill technology challenge

Suppose a website second kill activity launches only one commodity, is expected to attract 10,000 people to participate in the activity, also said that the maximum concurrent request number is 10000, the second kills the system to face the technical challenge to have:

Impact on existing website business

Second kill activity is only a Web site marketing an additional activity, this activity has a short time, the characteristics of a large number of concurrent visits, if the original application and site deployment together, will inevitably impact on the existing business, a slight careless may cause the entire website paralysis.

Solution: Deploy the second system independently, or even use a separate domain name, so that it is completely isolated from the site.

Application under high concurrency, database load

Users in the second kill before the start, by constantly refreshing the browser page to ensure that they do not miss the second kill, these requests if the general Web site application architecture, access to the application server, connect the database, the application server and the database server caused load pressure.

Solution: Redesign the second kill commodity page, do not use the original Product Details page, the content of the page static, user requests do not need to go through application services.

Sudden increase in network and server bandwidth

If the product page size 200K (mainly the size of the product picture), then the required network and server bandwidth is 2G (200kx10000), these network bandwidth is due to the second kill activity added, more than the usual use of the site bandwidth.

Solution: Because the second kill new network bandwidth, must and operators to buy or lease. In order to reduce the pressure on the Web server, you need to cache the Second product page in the CDN, the same need and CDN service provider temporarily leased new export bandwidth.

Direct the Order

Seconds to kill the rules of the game is to the second kill to start the purchase of goods, before this point in time, can only browse merchandise information, can not order. And the next single page is also a common URL, if you get this URL, do not have to wait until the second kill start can be a single.

Solution: In order to avoid users directly access to the next single page URL, need to change the URL dynamic, even if the second kill system developers can not start before the second kill access to the next single page URL. The method is to add the random number generated by the server side as a parameter in the next single page URL, which can be obtained when the second kill begins.

How to control the second Kill merchandise page Purchase button lit

The purchase button can only be lit when the second kill starts, before it is gray. If the page is dynamically generated, of course, you can construct the response page output on the server side, control the button is gray or light, but in order to reduce server load pressure, better use of CDN, reverse proxy performance optimization means, the page was designed as a static page, the slow existence of CDN, reverse proxy Server, Even on the user's browser. Second kill begins when the user refreshes the page and the request does not reach the application server at all.

Solution: Use JavaScript Script control to add a JavaScript file reference to the second Kill product static page, which contains the second kill start flag as no; Generate a new JavaScript file when the second kill starts (the filename remains unchanged , but the content is not the same), update second kill start flag is, add the URL of a single page and random number of parameters (this random number will only produce one, that is, everyone sees the URL is the same one, the server can use Redis this distributed cache server to save random numbers), and by the user browser load, Control the display of the second Kill product page. This JavaScript file can be loaded with a random version number (for example, xx.js?v=32353823) so that it is not cached by browsers, CDN, and reverse proxy servers.

This JavaScript file is very small, and even accessing the JavaScript file server every time the browser refreshes will not cause too much pressure on the server cluster and network bandwidth.

How to allow only the first submitted order to be sent to the order subsystem

Because there is only one user who will eventually be able to kill a product in seconds, it is necessary to check that an order has been submitted when the user submits the order. If an order has been submitted successfully, you need to update the JavaScript file, update the seconds to kill the start flag is no, the purchase button dimmed. In fact, because the end can successfully submit orders only one, in order to reduce the load pressure on the single page server, you can control the entry to the next single page, only a few users can enter the next single page, other users directly into the second Kill end page.

Solution: Assume that there are 10 servers in a single server cluster, and each server accepts up to 10 requests for a single order. If a server already has 10 orders before anyone has submitted the order successfully, and some of the single did not deal with, may appear the user experience is not good scene is the first time users click the purchase button into the finished page, and then refresh the page, there may be a single did not handle the server processing, entered the page to fill out the order, You can consider using cookies as a way to deal with the principle of consistency. Of course, the least connected load balancing algorithm can be used, and the probability of the above situation is greatly reduced.

How to do the next single front check

The next single server checks the number of down orders that have been processed by this machine:
If more than 10, direct return to the end of the page to the user;

If not more than 10, users can enter the order and confirmation page;

Check the number of globally committed orders:

have exceeded the total number of seconds killed merchandise, return to the end of the page to the user;

No more than seconds to kill the total number of products submitted to the sub-order system;

Second kill is usually on the shelf

This feature is implemented in many ways. But at present the better way is: Set up a good product in advance time, users can see the goods in the foreground, but can not click the "Buy Now" button. However, it is necessary to consider that someone can bypass the front end of the restrictions, directly through the URL to initiate the purchase, which requires the front page of merchandise, as well as the bug page to the back-end of the database, have to do clock synchronization. The more you control at the back end, the higher the security.

Timed seconds to kill, you should avoid sellers in seconds before the effect of the editor of the goods do not anticipate the impact. This particular change needs to be evaluated in many ways. Generally prohibit editing, if you need to change, you can go to the process of data revision many.

Operations to reduce inventory

There are two options, one is to take down the inventory of the other is to reduce the inventory of payments; The current use of "take down the inventory" approach, photographed is a moment, the user experience will be better.

Inventory leads to "oversold" issues: more Sales than inventory

Due to the inventory of concurrent updates, resulting in the actual inventory has been insufficient, the inventory is still in the reduction, resulting in the seller's merchandise sold more than a second kill expected. Scenario: Optimistic Locks used

<code>update auction_auctions Set
quantity = #inQuantity #
where auction_id = #itemId # and quantity = #dbQua ntity#
</code>

The response of the second kill device

Second kill device in general under a single purchase and its rapid, according to the purchase record can be identified part. Can pass the check code to achieve a certain method, this requires the check code is safe enough, not to be cracked, the way to use is: seconds Kill special verification code, TV release verification code, Seconds kill answer.

3 seconds to kill architecture principles

Try to intercept the request on the system upstream

The traditional second kill system hangs, the request all overwhelm the back-end data layer, data read and write lock conflict is serious, concurrent high response slow, almost all requests are timed out, although the flow is large, the success of the next single effective flow is very small "a train actually only 2000 tickets, 200w individuals to buy, basically no one can buy success, the request effective rate of 0".

Read and write less common use of multiple cache

This is a typical application scenario for reading and writing less "a train is actually only 2000 tickets, 200w individuals to buy, up to 2000 people under the single success, others are query inventory, write a ratio of only 0.1%, read the proportion of 99.9%", very suitable for the use of caching.

4 seconds to kill architecture design

Second kill system designed for second kill, different from the general online shopping behavior, participate in the second kill activities of users more concerned about how to quickly refresh the product page, in seconds to kill the beginning of the first to enter the next single page, rather than the details of the product details such as user experience, so the second kill system design should be as simple as possible.

The purchase button in the product page only becomes brighter when the second kill activity begins, before and after the product is sold, the button is gray and cannot be clicked.

The next form is as simple as possible, the purchase quantity can only be one and cannot be modified, shipping address and payment methods are used by the user default settings, there is no default can not be filled out, allowing orders to be submitted after the amendment; only the first submitted order is sent to the order subsystem of the website, and the remaining users can only see the second Kill end page after submitting the order.

To do a second kill system like this, the business will be divided into two phases, the first stage is seconds before the start of a time before the start of a second kill, this stage can be called the preparation phase, the user in the preparation phase waiting for the second kill; the second stage is the second kill start to all the users involved in the second kill to get the second kill results, this is called the second kill

4.1 Front end Layer Design

First of all to have a display of seconds to kill the product page, on this page to do a second kill activity to start the countdown, in the preparation phase users will open this second kill the page, and may keep refreshing the page. Here are two questions to consider:

The first one is the display of the second kill page

We know that an HTML page is still relatively large, even if you do a compression, HTTP headers and content can be as high as dozens of K, plus other css, JS, pictures and other resources, if at the same time there are tens of millions of people involved in a commodity snapping, general room bandwidth is only 1g~10g, Network bandwidth is very likely to become a bottleneck, so this page on all types of static resources should be stored separately, and then put to the CDN node dispersed pressure, because the CDN nodes all over the country, can buffer off most of the pressure, but also cheaper than the room bandwidth ~

The second one is the countdown.

For performance reasons this generally by JS call client local time, it is possible that the client clock and the server clock inconsistent, and the other server is also likely to occur clock inconsistencies. Client and server clock inconsistent can use client timing and server synchronization time, consider the performance problem, the interface for synchronization time because it does not involve back-end logic, just want to send the current Web server time to the client can be, so fast, the results of my previous test, see, A standard Web server 2W+QPS will not be a problem, if 100W people at the same time brush, 100W QPS also only need 50 web, a hardware LB on the ~, and the Web server farm can be easily horizontally extended (Lb+dns polling), This interface can only return a small number of JSON format data, but also can optimize the reduction of unnecessary cookies and other HTTP headers, so the amount of data is not very large, generally speaking, the network will not become a bottleneck, even if become a bottleneck can also consider a multi-machine room connection, plus intelligent DNS solution The time synchronization between Web servers can take the form of a unified time server, for example, every 1 minutes a Web server that participates in a second kill activity is synchronized with a time server.

Browser Layer Request interception

(1) The product level, the user clicks "inquires" or "buys the ticket", the button resets the ash, prohibits the user to repeat submits the request;

(2) JS level, limit the user within X seconds can only submit a request;

4.2 Site Layer Design

The front-end layer of the request interception, can only stop small white users (but this is 99% of the user yo), high-end programmers do not eat this set, write a for loop, directly call your back-end HTTP request, how to complete?

(1) The same UID, limit access frequency, do the page cache, within X seconds to reach the site layer request, are returned to the same page

(2) The same item query, such as mobile phone number, do the page cache, x seconds to reach the site layer of the request, are returned to the same page

With this current limit, another 99% of the traffic will be intercepted at the site level.

4.3 Service layer Design

Site Layer request interception, can only stop ordinary programmers, senior hackers, assuming he controlled the 10w broiler (and assume that the ticket does not need a real-name certification), the limit of the UID is not it? What's the whole?

(1) Eldest brother, I am the service layer, I clearly know that millet only 10,000 mobile phones, I clearly know that a train only 2000 tickets, I 10w a request to the database what is the significance of it? For write requests, to do the request queue, each time only through a limited write request to the data layer, if all are successfully put down a batch, if the inventory is not enough in the queue to write requests all returned "sold out";

(2) for the read request, still say? Cache to resist, whether it is memcached or Redis, single machine against a 10w per second should be no problem;

With such limited flow, only very few write requests, and very few requests for read cache MIS will go through the data layer, and 99.9% of requests are stopped.

User Request Distribution module: Use Nginx or Apache to distribute a user's request to a different machine.
User Request preprocessing module: Determine if the product has a surplus to decide whether to process the request.
User request Processing module: Encapsulation of the request through preprocessing into a transaction submitted to the database, and returns whether the success.
Database Interface module: The module is the only interface of the database, responsible for interacting with the database, providing RPC interface to query whether the seconds kill end, the remaining number of information.

User Request preprocessing module

After the distribution of the HTTP server, the load of a single server is relatively low, but the total can still be very large, if the back of the product has been killed by seconds, then directly to the subsequent request to return a second kill failed, no further send transactions, sample code can be as follows:

<code>package Seckill;
Import Org.apache.http.HttpRequest;
/**
* Preprocessing phase, the unnecessary request directly dismissed, the necessary request to add to the queue to enter the next stage.
* * Public
class Preprocessor {

//product has the remaining
private static Boolean reminds = True;
private static void Forbidden () {

//do something.
}
public static Boolean checkreminds () {
if (reminds) {

//remote detection is still remaining, the RPC interface should be provided by the database server and not strictly checked.
if (! Rpc.checkreminds ()) {
reminds = false;
}
}
return reminds;
}
/**
* Each HTTP request is subject to this preprocessing.
*
/public static void Preprocess (HttpRequest request) {
if (Checkreminds ()) {

//a concurrent queue
RequestQueue.queue.add (request);
} else {

//If no product is already available, then simply dismiss the request.
Forbidden ();
}} </code>

Selection of concurrent queues

The Java Concurrency Package provides three common concurrent queue implementations: Concurrentlinkedqueue, Linkedblockingqueue, and Arrayblockingqueue.

Arrayblockingqueue is the initial capacity fixed blocking queue, we can use as a database module successfully auction queue, such as 10 items, then we set a 10 size array queue.

Concurrentlinkedqueue uses the CAS primitive non-lock queue implementation, is an asynchronous queue, the speed of the team, out of the lock, performance slightly slower.

Linkedblockingqueue is also a blocking queue, the team and the team are used to add locks, when the line is empty when the thread will temporarily block.

Since our system team needs to be far larger than the team requirements, generally do not appear in the queue, so we can choose Concurrentlinkedqueue as our request queues implementation:

<code>package Seckill;
Import Java.util.concurrent.ArrayBlockingQueue;
Import Java.util.concurrent.ConcurrentLinkedQueue;
Import Org.apache.http.HttpRequest;
public class Requestqueue {public
static concurrentlinkedqueue

User Request Module

<code>package Seckill;
Import Org.apache.http.HttpRequest;
The public class Processor {
/**
* Sends seconds to kill transactions to the database queue.
*/public
static void Kill (Bidinfo info) {
DB.bids.add (info);
}
public static void process () {
Bidinfo info = new Bidinfo (RequestQueue.queue.poll ());
if (info!= null) {
Kill (info);
}}} Class Bidinfo {
bidinfo (HttpRequest request) {

//do something.
}
}
</code>

Database module

The database primarily uses a arrayblockingqueue to hold potentially successful user requests.

<code>package Seckill;
Import Java.util.concurrent.ArrayBlockingQueue;
/**
* DB should be the only interface to the database.
*
/public class DB {public
static int count = ten;
public static arrayblockingqueue<bidinfo> bids = new arrayblockingqueue<bidinfo> (a);
public static Boolean checkreminds () {

//TODO return
true;
}
Single-threaded operation public
static void bid () {
Bidinfo info = bids.poll ();
while (count--> 0) {
//Insert to Table Bids values (item_id, user_id, Bid_date, other)
//SELECT COUNT (ID) From Bids where item_id =?
If the number of database items is approximately total, the flag seconds Kill has been completed, setting the flag bit reminds = False.
info = Bids.poll ();
}} </code>

4.4 Database Design

4.4.1 Basic Concepts

Concept One "single library"

Concept two "fragmentation"

Fragmentation solves the problem of "too much data", which is usually called "horizontal segmentation". Once the introduction of fragmentation, there is bound to be "data routing" concept, which data access to which library. There are usually 3 methods of routing rules:

Range: Range

Advantages: Simple, easy to expand

Disadvantage: Each library pressure uneven (new section more active)

Hash: Hash "most Internet companies use the program two: hash the library, Hashilu by"

Advantages: Simple, balanced data, uniform load

Disadvantage: Migration Trouble (2 Library expansion 3 Library data to migrate)

Routing services: Router-config-server

Advantages: Strong flexibility, decoupling of business and routing algorithms

Disadvantage: Multiple queries before accessing the database each time

Concept three "groupings"

Grouping solves the "usability" problem, and grouping is usually done in the form of master-slave replication.

Internet company database The actual software architecture is: and fragmented, and grouped (as shown below)

4.4.2 Design Ideas

What do database software architects usually design? At a minimum, consider the following four points:

How to ensure the availability of data;
How to improve the reading performance of the database (most applications read and write less, reading will become the bottleneck first);
How to ensure consistency;
How to improve the scalability;

1. How to ensure the availability of data?

The idea of solving usability problems is => redundancy

How do I ensure the availability of the site? Replicate sites, redundant sites

How do I guarantee the availability of a service? Replication services, redundant services

How do I ensure the availability of data? Copy data, redundant data

The redundancy of the data will result in a side effect => causing consistency issues (not to mention consistency issues first, usability).

2. How to ensure that the database "read" High availability?

Redundant Read Library

What are the side effects of redundant read libraries? Read and write is delayed and may be inconsistent

The above diagram is a lot of internet company MySQL architecture, writing is still a single point, can not guarantee that write high availability.

3. How to ensure that the database "write" high availability?

Redundant Write Library

Can redundant write libraries have side effects by using a dual-master-standby approach? Double write synchronization, data may conflict (for example, "id" sync conflict), how to resolve synchronization conflicts, there are two common solutions:

Two write libraries use different initial values, the same step size to increase the id:1 write library ID for 0,2,4,6 ... 2 Write Library ID is 1,3,5,7 ... ；
Do not use the ID of the data, the business layer itself generate a unique ID, to ensure that the data does not conflict;
In fact, there is no use of the above two architectures to do read and write "High availability", using the "dual-master as Master-slave" approach:

is still a double master, but only one of the main services (read + write), another master is "Shadow-master", only to ensure high availability, usually do not provide services. Master Hung, Shadow-master on top (VIP drift, transparent to the business layer, no need for human intervention). The benefits of this approach are:

No delay in reading and writing;

High availability of reading and writing;

Insufficient:

Read performance cannot be extended by adding a library;
Resource utilization is 50%, a redundant master does not provide service;
So how do you improve your reading performance? Enter the second topic, how to provide read performance.

4. How to extend Read performance

There are roughly three ways to improve read performance, the first of which is indexing. This way does not unfold, and the point to mention is that different libraries can establish different indexes.

Write library does not establish index;

Online reading library to establish an online access index, such as UID;

Offline Read library to establish offline access index, such as time;

The second way to expand read performance is to add more from the library, a method that everyone uses more, but there are two disadvantages:

The more from the library, the slower the synchronization;

The slower the synchronization, the larger the Data Inconsistency window (after the inconsistency, or read the performance of the improvement);
In practice, this method is not used to improve database read performance (not from the library), with the addition of caching. The common caching architecture is as follows:

Upstream is the business application, downstream is the main library, from the library (read-write separation), cache.

The actual play: Services + database + cache a set of

The business layer is not directly oriented to DB and cache, and the service layer masks the complexity of the underlying DB and cache. Why to introduce the service layer, not today, using the "Service + database + cache Set" way to provide data access, with cache to improve read performance.

Whether to extend read performance in a master-slave manner or to extend read performance in a cached way, data must replicate multiple copies (primary + from, Db+cache), which is bound to cause consistency problems.

5. How to ensure consistency?

Master-Slave database consistency, there are usually two solutions:

1. Middleware

If a key has a write operation, in the inconsistent time window, the middleware will also route the key read operations to the main library. The disadvantage of this scheme is that the threshold of database middleware is higher (Baidu, Tencent, Ali, and some other companies).

2. Mandatory reading of the master

The actual use of the "dual-master as Master-slave" architecture, there is no problem of the Lord is never consistent.

The second type of inconsistency is the inconsistency between DB and cache:

The common caching architectures are as follows: The order of the write operations at this time is

(1) Elimination of cache;

(2) write the database;

The order of the read operations is:

(1) Read cache, if the cache hit return;

(2) If the cache miss, then read from the library;

(3) After reading from the library, put the data back to the cache;

In some abnormal timing cases, it is possible to read from the library to the old data (synchronization has not yet completed), the old data into the cache, the data will be inconsistent for a long time. The workaround is "cache double elimination", and the write operation sequence is upgraded to:

(1) Elimination of cache;

(2) write the database;

(3) After the experience of "master-Slave synchronization delay Window time", again launched an asynchronous elimination cache request;

In this way, even if there is dirty data such as cache, a small time window, dirty data will be eliminated. The cost is to introduce one more read miss (costs can be ignored).

In addition, one of the best practices is to recommend that you set a timeout for all item in the cache.

6. How to improve the scalability of the database?

The original hash way of routing, divided into 2 libraries, the amount of data is still too large, to be divided into 3 libraries, it is bound to need data migration, there is a very handsome "database second-level expansion" program.

How to enlarge the second level?

First, we do not do 2 library change 3 library expansion, we do 2 library variable 4 library (library doubling) of the expansion (future 4->8->16)

The service + database is a set (eliminates the cache), the database uses "two main" mode.

Expansion steps:

The first step is to promote a main library;

The second step, modify the configuration, 2 library changes 4 library (originally MOD2, now configured to modify the MOD4), expansion completed;

The original MOD2 is even part, now will MOD4 more than 0 or 2, the original MOD2 for the odd part, now will MOD4 1 or 3, the data does not need to migrate, at the same time, the two main sync, the other is more than 0, one more than 2, both sides of the data synchronization will not conflict, second level to complete expansion

Finally, to do some finishing work:

The old double master is relieved synchronously;

Add a new double master (the two main is to ensure availability, shadow-master usually do not provide services);

Delete redundant data (more than 0 of the main, you can delete the remaining 2 of the data);

In this way, within the second level, we completed the 2 Library Transformation 4 Library expansion.

5 challenges posed by concurrency

Reasonable design of 5.1 request interface

A second kill or snapped page, usually divided into 2 parts, one is static HTML content, the other is involved in the second Kill Web background request interface.

Usually static HTML and other content, is through the deployment of CDN, the general pressure is not, the core bottleneck is actually in the background request interface. This backend interface must be able to support high concurrent requests, and at the same time, it is important to be as "fast" as possible and to return the user's request results in the shortest amount of time. In order to achieve this as quickly as possible, the back-end storage of the interface is better to use memory-level operations. Storage that is still directly oriented to MySQL is not appropriate, and it is recommended that asynchronous writes be used if there is a need for this complex business.

Of course, there are also some seconds to kill and buy the use of "lag feedback", that is, seconds to kill now do not know the results, a period of time before you can see from the page whether the user seconds kill success. However, this kind of "lazy" behavior, but also to the user experience is not good, easy to be considered by the user is "black box operation."

5.2 High Concurrent challenges: Be sure to "fast"

We usually measure the throughput of a web system is QPS (Query per Second, processing requests every second), solve tens of thousands of times per second high concurrency scenario, this metric is critical. For example, we assume that the average response time for a business request is 100ms, and that there are 20 Apache Web servers in the system, with a configuration of maxclients of 500 (representing the maximum number of connections to Apache).

So, the theoretical peak of our web system is QPS (idealized calculation):

Copy Code code as follows:

20*500/0.1 = 100000 (100,000 QPS)

Hey? Our system seems to be very strong, 1 seconds to handle 100,000 of the request, 5W/S's second kill seems to be "paper Tiger" ha. The actual situation, of course, is not so ideal. In a high concurrency scenario, the machine is in a high load state, at which point the average response time is greatly increased.

As far as the Web server is concerned, the more connected processes the Apache opens, the more context switches the CPU needs to handle, the additional CPU consumption, and then the direct increase in the average response time. Therefore, the above maxclient number, according to CPU, memory and other hardware factors, not the more the better. You can test it with the Abench from Apache and take a suitable value. Then, we select the memory-level storage Redis, which is critical in the high concurrency state of the storage response time. Although network bandwidth is also a factor, however, this request packet is generally relatively small, generally rarely become the bottleneck of the request. Load balancing is less of a system bottleneck and is not discussed here.

So the question is, assuming our system, in the high concurrency state of the 5w/s, the average response time from 100ms to 250ms (actual, even more):

Copy Code code as follows:

20*500/0.25 = 40000 (40,000 QPS)

As a result, our system left the 4w QPS, the face of 5w per second request, the middle of the difference between 1w.

Then, this is the real nightmare to begin with. For example, high-speed intersection, 1 seconds to 5 vehicles, 5 vehicles per second, high-speed junction operation is normal. Suddenly, this intersection 1 seconds only through 4 vehicles, the flow of traffic is still, the result must be a big jam. (5 lanes suddenly become the feeling of 4 lanes).

Similarly, within a second, the 20*500 available connection processes are at full load, but there are still 10,000 new requests, no connection processes available, and the system is expected to fall into an abnormal state.

In a normal, non-high concurrency business scenario, there is a similar situation where a business request interface is problematic, response time is very slow, the entire Web request response time is pulled long, the Web server is gradually filled with the number of available connections, other normal business requests, no connection process available.

The more terrible problem is that the behavior of the user is characteristic, the more unavailable The system is, the more frequent the user clicks, the vicious circle eventually leads to "avalanches" (one web machine hangs, causing traffic to spread to other working machines, causing the normal machines to hang, and then the vicious cycle), bringing down the entire web system.

5.3 Reboot and overload protection

If the system occurs "avalanche", hastily restart the service, is unable to solve the problem. The most common phenomenon is that, after starting up, immediately hung up. At this time, it is best to reject traffic at the entry level and then reboot. If it is redis/memcache this service is also hung up, you need to pay attention to "warm up" when restarting, and it is likely to take a long time.

Second kill and snapping scenes, traffic is often beyond our system of preparation and imagination. At this time, overload protection is necessary. A denial of request is also a protection measure if the system is detected as full load. Setting the filter on the front end is the easiest way to do it, but the behavior is "CHOUFSO" by the user. More appropriately, the overload protection is set at the CGI entry layer to quickly return the customer's direct request.

6 means of cheating: offense and defense

Seconds to kill and snapped up received a "massive" request, in fact, the water inside is very large. Many users, in order to "grab" the merchandise, will use the "Brush ticket tool" and other types of assistive tools to help them send as many requests to the server. There is also a subset of advanced users who make powerful automated request scripts. The rationale for this practice is also simple: in the request to participate in the second kill and snapped up, the number of their own requests accounted for more, the higher the probability of success.

These are "cheating means", however, there is "offensive" there is "defensive", this is a battle without the smoke of the war ha.

6.1 The same account, send multiple requests at a one-time

Some users through the browser plug-ins or other tools, in the beginning of the second kill time, to their own account, send hundreds or even more requests. In fact, such users destroy the fairness of second kill and snapping.

This request can also cause another kind of damage in some systems that do not have data security processing, causing some judgment conditions to be bypassed. For example, a simple pick logic, first to determine whether the user has participated in the record, if not to obtain success, and finally write to the participation record. This is a very simple logic, but in a high concurrency scenario, there are deep vulnerabilities. Multiple concurrent requests are distributed to multiple Web servers in the intranet through a load-balancing server, which sends a query request to the store, and then, in the time lag when a request is successfully written to the participating record, the other requests are "not participating in the record". Here, there is the risk of a logical judgment being bypassed.

Response plan:

At the entrance of the program, an account is allowed to accept only 1 requests and other requests to filter. Not only solve the same account, send n request questions, but also ensure the subsequent logical process of security. Implementation, you can write a flag bit by Redis this memory cache service (only allow 1 requests to write successfully, combined with watch optimistic lock characteristics), the successful write can continue to participate.

Or, implement a service yourself, put the request of the same account into a queue, process one, and then process the next.

More than 6.2 accounts, send multiple requests at once

Many companies account registration function, in the early development of almost no restrictions, it is easy to register a number of accounts. Therefore, also led to the emergence of a number of special studios, by writing automatic registration scripts, accumulated a large number of "zombie account", a huge amount, tens of thousands of or even hundreds of thousands of of the account range, specialized in all kinds of brush behavior (this is the Micro-blog "zombie powder" source). For instance, for example, there is a forward lottery in the microblog, if we use tens of thousands of "zombie number" to go into the forwarding, so that we can greatly improve the probability of winning the lottery.

This account, used in the second kill and snapping, is the same reason. For example, the iphone's official website snapped up, train ticket scalpers.

Response plan:

This scenario can be resolved by detecting the frequency of the specified machine IP request, and if an IP request is found to be a high frequency, it can be ejected with a captcha or a direct prohibition of its request:

Pop-up verification code, the most core pursuit is to identify the real user. Therefore, we may often find that the site pop-up verification code, some are "ghosts and dance" appearance, sometimes let us simply can not see. The reason they do this is also to make the image of the verification code not easily recognized, because the powerful "automatic script" can be used to identify the characters in the image, and then let the script automatically fill in the Verification code. In fact, there are some very innovative verification code, the effect will be better, for example, to give you a simple question to answer, or let you complete some simple operations (such as Baidu Post Bar Verification code).
The direct prohibition of IP, in fact, is somewhat rude, because some real users of the network scene is exactly the same export IP, there may be "accidental injury." However, this approach is simple and efficient, and can be achieved with good results based on actual scenarios.

more than 6.3 accounts, different IP send different please ask

The so-called while, outsmart. There is an attack, there will be defense, never rest. These "studio", found that you have a single IP request frequency control, they also for this scenario, came up with their "New attack plan" is to constantly change the IP.

There are classmates wondering how these random IP services come in. Some of the agencies themselves occupy a group of independent IP, and then made a random proxy IP services, paid to these "studio" use. There are some more dark, that is, through the Trojan black ordinary users of the computer, this Trojan does not damage the normal operation of the user's computer, only to do one thing, that is, forwarding IP packets, ordinary users of the computer has become an IP agent export. In this way, hackers get a lot of independent IP, and then set up for random IP services, is to make money.

Response plan:

To tell the truth, this kind of scene request, and the real user's behavior, already basically the same, wants to make the discrimination very difficult. Further restrictions are easy to "accidentally hurt" the real user, this time, usually only by setting the high threshold of the business to limit the request, or through the account behavior of the "data mining" to clean up early.

Zombie account also has some common characteristics, such as the account is likely to belong to the same number or even the number, the active degree is not high, low grade, incomplete data and so on. According to these characteristics, appropriate to set the threshold for participation, such as limiting the number of seconds to participate in the number of kill account level. Through these business tools, it is also possible to filter out some zombie numbers.

7 data security under high concurrency

We know that when multithreaded writing to the same file, there is a "thread-safe" problem (multiple threads running the same piece of code at the same time, if the result of each run and single-threaded run is the same, the result is the same as expected, is thread-safe). If it is a MySQL database, you can use its own locking mechanism to solve the problem, but in large-scale concurrent scenarios, is not recommended to use MySQL. Second kill and snapped up the scene, there is another problem, that is, "super hair", if the control inadvertently in this area, will produce too much to send. We have also heard that some electric dealers to buy activities, the buyer successfully photographed, the merchant did not recognize the order is valid, refused to ship. The problem here, perhaps not necessarily is the businessman treacherous, but the system technical level has the excess risk to cause.

7.1 Causes of the super hair

Suppose we have a total of 100 items in a snapping scene, at the last minute we have consumed 99 items, and only the last one. This time, the system sent a number of concurrent requests, this batch of requests to read the merchandise margin is 99, and then all passed this one margin judgment, resulting in a super hair.

In the above diagram, the concurrent User B is also "snapped up", allowing a person to get a product. This kind of scene is very easy to appear in the case of high concurrency.

7.2 Pessimistic lock mentality

There are many ways to solve thread safety, which can be discussed from the direction of "pessimistic lock".

Pessimistic lock, that is, in the modification of data, the use of lock state, the exclusion of external requests for modification. When you encounter a lock state, you must wait.

While the above solution does solve the problem of thread safety, don't forget that our scenario is "high concurrency". In other words, there are many such requests for modification, each of which needs to wait for the "lock", and some threads may never get a chance to grab the "lock", and the request will die there. At the same time, this request will be many, the instantaneous increase in the system's average response time, the result is the number of available connections are depleted, the system into an abnormal.

7.3 FIFO Queue idea

Well, then let's change the above scene a little bit, and we'll just put the request into the queue, using FIFO (First Input Output, first-in first-out), so that we don't cause some requests to never get a lock. See here, is not a bit of a brute force to become a single thread of the Feeling ha.

Then, we now solve the problem of the lock, all requests in the "FIFO" queue to deal with. So the new problem comes, high concurrency scenario, because a lot of requests, it is likely that the queue memory "burst", and then the system into an abnormal state. or design a large memory queue, but also a scheme, but the system processing a queue within the speed of the request can not be compared to the number of crazy influx queue. That is, the more the requests within the queue accumulate, the more likely the final web system will fall in the average response, and the system is still in the abnormal.

7.4 Optimistic lock thinking

This time, we can discuss the idea of "optimistic lock". Optimistic lock, is relative to the "pessimistic lock" with a more relaxed locking mechanism, most of the use of a version of the update. The implementation is that this data all requests are eligible to modify, but will get a version of the data, only the version number of the match can be updated successfully, other return snapped up failed. In this case, we don't need to consider the queue problem, but it increases the CPU's computational overhead. But, overall, this is a better solution.

There are a lot of software and services that are "optimistic lock" support, such as the Redis watch is one of them. Through this implementation, we guarantee the security of the data.

8 Summary

The internet is developing at a high speed, the more users who use Internet services, the higher the concurrent scene becomes more and more. Electric trader second kill and snapped up, is two more typical internet high concurrent scene. While the specific technical solutions to our problems may vary, the challenges we face are similar, and so are the solutions to the problem.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The framework analysis and actual combat _android of the second kill system in limited-timed buying

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The framework analysis and actual combat _android of the second kill system in limited-timed buying

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support