"Turn" Seconds to kill the road of optimization of business architecture

Last Update:2016-08-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original address: http://www.infoq.com/cn/articles/flash-deal-architecture-optimization/

First, second kill business why difficult to do

Im system, such as QQ or Weibo, everyone reads their own data (Friends List, group list, personal information).

Microblogging system, everyone reads the data of people you care about, one person reads more personal data.

Seconds to kill the system, only one copy of the inventory, everyone will be in the central time to read and write this data, many people read a data.

For example, Xiaomi mobile phone every Tuesday seconds kill, may have only 10,000 mobile phones, but instantaneous access to the flow may be hundreds of tens of millions of. Another example 12306 Rob Ticket, the ticket is limited, a stock, instantaneous flow very much, are read the same inventory. Read-write conflict, lock is very serious, this is the second to kill business difficult place. So how do we optimize the structure of the second-kill business?

Second, the direction of optimization

There are two optimization directions:

Intercept requests as much as possible upstream of the system (do not let lock collisions fall into the database). The traditional second kill system hangs, the request has overwhelmed the back-end data layer, the data read-write lock conflict is serious, concurrency high response slow, almost all requests are timed out, the traffic is large, the effective flow of the next single success is very small. Take 12306 For example, a train actually only 2000 tickets, 200w individual to buy, basically no one can buy success, request effective rate is 0.
Make full use of the cache, seconds to kill tickets, this is a typical read more less application scenarios, most of the requests are train queries, ticket inquiries, orders and payment is the write request. A train actually only 2000 tickets, 200w individuals to buy, up to 2000 people to order success, others are query inventory, write a ratio of only 0.1%, read the proportion of 99.9%, is very suitable to use the cache to optimize. Well, follow up on how to "intercept requests as far as possible in the system upstream" method, and how a "cache" method, to tell the details.

Third, the common second-kill architecture

The common site architecture is basically such (especially the traffic billions of site architecture):

Browser-side, top-level, will execute to some JS code
The site layer, which accesses the backend data, and the spelling HTML page is returned to the browser
Service layer, which shields the underlying data details upstream, providing data access
Data layer, the final inventory is present here, MySQL is a typical (and of course there will be cache)

This diagram is simple, but can be described as an image of high traffic and the second kill business architecture, we must remember this picture.

Explain how to optimize each level in the back.

Four, the level of optimization details of the first layer, the client how to optimize (browser layer, app layer)

Ask everyone a question, everyone has played a shake to rob red envelopes, every time you shake, will the back end send a request? Review our next single grab ticket scene, click on the "Query" button, the system that card Ah, progress bar rose slowly ah, as a user, I will not consciously click "Query", right? Go on, go on, point, Dot ... Is it useful? The system load is increased by no means, a user point 5 times, 80% of the requests are so many out, how the whole?

Product level, users click on the "query" or "Purchase tickets", the button gray, prohibit users to repeat the request;
JS level, limit the user can only submit one request within x seconds;

App level, you can do similar things, although you are shaking wildly, in fact, x seconds to the back end to initiate a request. This is called "to intercept the request as far as possible in the system upstream", the better upstream, the browser layer, the app layer to stop, so that can block the 80%+ request, this method can only stop the ordinary user (but 99% of the users are ordinary users) for the high-end programmers in the group cannot stop.

Firebug a grab packet, http long what kind of know, JS is absolutely can't stop programmer write for loop, call HTTP interface, this part of the request how to deal with?

Second tier, site-level request interception

How to intercept? How to prevent the programmer to write for loop call, have to go back to the basis? Ip? Cookie-id? ... Want to complicate, this kind of business all need to log in, with UID can. At the site level, the UID is requested to count and de-weigh, even without the need for uniform storage count, direct site-level memory storage (this count will not be allowed, but the simplest). A uid,5 second is only allowed through 1 requests, which in turn stops 99% for loop requests.

5s only through one request, what about the remaining requests? Cache, page cache, the same UID, limit the frequency of access, do page caching, in x seconds to reach the site layer of the request, all return to the same page. Queries of the same item, such as TRIPS, page caches, and requests that arrive at the site level within X seconds, return to the same page. This current limit allows the user to have a good user experience (no return 404) and the robustness of the system (using page caching to intercept requests at the site level).

Page caching does not necessarily ensure that all sites return consistent pages, and that the memory that is placed directly on each site is also possible. The advantage is simple, the disadvantage is that the HTTP request falls to different sites, the return ticket data may not be the same, this is the site layer of request interception and cache optimization.

OK, this way stopped the write for the HTTP request of the programmer, some high-end programmers (hackers) control the 10w broiler, there are 10w uid in the hand, while sending the request (do not consider the real-name system problem, millet Rob phone does not need real-name System), how to do, the site layer in accordance with the UID limit flow.

Third layer of service layer to intercept (just don't let the request fall into the database anyway)

How does the service layer intercept? Eldest brother, I am the service layer, I know that millet only 10,000 mobile phones, I know that a train only 2000 tickets, I 10w a request to go to the database what is the point? Yes, request queue!

For write requests, make the request queue, each time only limited write requests go to the data layer (place order, pay such write business):

1w Mobile phone, only 1w order request to go to DB:
3k train ticket, only 3k order request to DB.

If you have successfully put down a batch, write requests in the queue will all return "sold out" if the inventory is insufficient.

How to optimize for read requests? Cache anti, whether it is memcached or Redis, stand-alone 10w per second should be no problem. With this current limit, only very few write requests, and very few requests to read the cache mis go through the data layer, and 99.9% of requests are stopped.

Of course, there are some optimizations on the business rules. Recall 12306 do, time-division ticket sales, the original unified 10 points to sell tickets, now 8, 8:30, 9 points, ... Release a batch every half hour: spread the flow evenly.

Secondly, the granularity of data optimization: You go to buy tickets, for the rest of the ticket query this business, the ticket left 58, or 26, you really care about it, in fact, we only care about tickets and no ticket? When the traffic is large, make a coarse-grained "ticket", "No Ticket" cache.

Third, some of the business logic of the asynchronous: for example, the separation of business and payment business. These optimizations are combined with business, and I've shared a view that "everything out of business architecture is designed to be bullying" architecture is optimized for the business as well.

Finally, the database layer

The browser intercepts 80%, the site layer intercepts 99.9% and does the page cache, and the service layer makes the write request queue and the data cache, and each request to the database layer is controllable. DB basically there is no pressure, leisurely stroll, stand alone can carry, or that sentence, inventory is limited, millet production capacity is limited, so many requests to the database is meaningless.

All through the database, 100w orders, 0 successful, the request efficiency 0%. Through 3k to the data, all successful, the request effective rate of 100%.

V. Summary

The above should be described very clearly, nothing summed up, for the second kill system, repeat my personal experience of the two architecture optimization ideas:

Try to intercept the request on the upstream of the system (the better upstream);
Read more write less common use cache (cache anti-read pressure);

Browser and app: do speed limit. Site layer: According to the UID do speed limit, do page caching. Service layer: According to the business write request queue control traffic, do data cache. Data layer: Stroll. and combine business to optimize

Liu, Q&a

Question 1, according to your structure, in fact, the most pressure is the site layer, assuming that the number of real effective requests have 10 million, it is unlikely to limit the number of requests to connect it, then this part of the pressure how to deal with it?

A: The concurrency per second may not be 1kw, fake with 1kw, Solution 2:

The site layer can be expanded by adding machines, the most incompetent 1k machine to chant.
If the machine is not enough, discard the request, discard 50% (50% go directly back to try again later), the principle is to protect the system, can not let all users fail.

Question 2, "control 10w Broiler, have 10w uid in hand, and ask" How to solve the problem, huh?

A: Above said, service layer write request queue control

3: Is it possible to search for a cache that restricts the frequency of access? For example a user searches for "phone", b users search for "phone", priority to use a search generated after the cache page?

A: This is OK, this method is often used in the "dynamic" Operation activity page, such as the short time push 4kw user App-push operation activity, do page cache.

4: What to do if the queue processing fails? What if the chickens are blown out of the queue?

A: Processing failed to return to the next order failure, let the user try again. The queue cost is very low, it's hard to explode. In the worst case, after several requests have been cached, subsequent requests are returned directly to "no tickets" (there are already 100w requests in the queue, all waiting, and no more accepting requests).

5: The site layer filter, is the UID request number is stored separately to the memory of each site? If so, how can multiple server clusters be handled by the load Balancer distributing the same user's response to different servers? Or will the site layer filter be placed before the load balancer?

A: Can be placed in memory, so that it appears that a server limit 5s a request, the overall (assuming there are 10 machines), in fact, the limit of 5s 10 requests, the solution:

Increased restrictions (This is the recommended solution, the simplest)
Make a 7-layer equalization on the Nginx layer so that a UID request falls to the same machine as possible

6: The service layer filter, the queue is a unified service layer queue? or a queue for each server that provides the service? If it is a unified queue, do you need to lock control before the requests submitted by each server are queued?

A: You can not unify a queue, so that each service through a smaller number of requests (total number of votes/services), so simple. Unifying a queue is complicated.

Question 7: After the second kill payment completed, as well as unpaid cancellation of the placeholder, how to make timely control of the remaining inventory updates?

A: The database in a state, not paid. If the time, for example, 45 minutes, the inventory will be restored (well-known "back to the position"), give us a ticket to the revelation is that after the second kill, 45 minutes later try again, maybe there are tickets yo.

Issue 8: Different users browse the same product that falls on different cache instances displays the inventory exactly the same how does the teacher do the cache data consistent or is it allowed to dirty read?

A: The current architecture design, requests fall to different sites, the data may be inconsistent (page cache is not the same), this business scenario can be accepted. But real data at the database level is no problem.

Question 9: Even in the business to optimize the consideration of "3k train tickets, only 3k orders to the next single request to DB" then this 3k order will not have congestion?

A: (1) database anti-3k write requests or OK, (2) can be data splitting, (3) if the 3k can not carry, the service layer may control through the past the number of concurrent, according to the pressure measurement situation, 3k is just an example;

Question 10: If the background fails at the site level or at the service level, do you need to consider replaying requests that failed this batch? Or just throw it away?

Answer: Do not replay, return user query failed or the order failed, one of the architectural design principles is "fail fast."

Question 11: For large systems, such as the second kill, such as 12306, while the number of seconds to kill a lot, how to shunt?

Answer: Vertical Split

Question 12: An extra question. Is this process synchronous or asynchronous? In the case of synchronization, there should be a slow response to feedback. But if it is asynchronous, how do you control the ability to return the result of the response to the correct requester?

A: The user level must be synchronous (the user's HTTP request is compacted), the service plane can be synchronous and asynchronous.

Question 13: Seconds to kill a group of questions: reduce inventory at that stage minus it? If it is the next single-lock inventory, a large number of malicious users under the single lock inventory and not pay how to deal with it?

A: Database level write request is very low, fortunately, the next order does not pay, and so on, and then "back to the warehouse", mentioned before.

"Turn" Seconds to kill the road of optimization of business architecture

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Turn" Seconds to kill the road of optimization of business architecture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Turn" Seconds to kill the road of optimization of business architecture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support