Java second kill business architecture design path, java Architecture Design
I. Why is flash sales hard?
IM system, such as QQ or Weibo, where everyone reads their own data (friends list, group list, and personal information ).
In the Weibo system, each person reads the data of the people you care about and the data of multiple people.
The second kill system has only one inventory. Everyone reads and writes the data at the centralized time, and multiple users read one data.
For example, the second kill of a Xiaomi phone every Tuesday may contain only 10 thousand mobile phones, but the instantaneous incoming traffic may be tens of millions. For example, if you want to grab a ticket of 12306, the ticket is limited, and the stock is in stock. There is a large amount of instantaneous traffic, so you can read the same stock. Read/write conflicts and locks are very serious, which is difficult for seckilling businesses. So how can we optimize the architecture of the second kill service?
Ii. Optimization Direction
There are two optimization directions:
Iii. Common seckilling Architecture
The common website architecture is basically like this (especially the website architecture with hundreds of millions of traffic ):
This figure is simple, but it shows the architecture of high-traffic and high-concurrency flash sales. Remember this figure.
Next, we will explain how to optimize each level.
Iv. Optimization details at various levels
The first layer, how to optimize the client (Browser layer, APP layer)
I want to ask you a question. Have you ever used to shake and grab a red envelope? Will you send a request to the backend every time you shake it? Review the ticket grabbing scenario. After clicking the "query" button, the system card slows down the progress bar. As a user, I will unconsciously click "query ", right? Click continue. Click continue ...... Is it useful? Without any reason, the system load has been increased. A user clicks 5 times and 80% of requests come out so many times. How can this problem be solved?
- At the product level, after you click "query" or "buy tickets", the buttons are dimmed to prevent repeated requests from being submitted;
- At the JS level, users are limited to submit only one request within x seconds;
At the APP level, you can do similar things. Although you are crazy about shaking, you actually initiate a request to the backend in x seconds. This is the so-called "try to intercept requests in the system upstream". The higher the upstream, the better. The browser layer and the APP layer are blocked, so that 80% + requests can be blocked, this method can only stop common users (but 99% of users are normal users) and can't stop high-end programmers in the group.
When FireBug captures packets, HTTP always knows that JS cannot stop programmers from writing for loops and calling HTTP interfaces. How can this part of requests be handled?
Layer 2: site-level Request Interception
How to intercept it? How can we prevent programmers from writing for loop calls? Is there a de-duplication basis? IP? Cookie-id ?... To solve this problem, you need to log on to this type of business and use the uid. At the site level, requests are counted and de-duplicated for the uid, and even the Unified Storage count is not required. The site-layer memory storage is used directly (this will make the count inaccurate, but the easiest ). A uid can only pass through one request in 5 seconds, which can block 99% for loop requests.
What should I do if I only send one request within five seconds? Cache, page cache, the same uid, restricted access frequency, page cache, requests that arrive at the site layer within x seconds, all return the same page. For queries of the same item, such as vehicle count, page caching, and requests arriving at the site layer within x seconds, the same page is returned. Such throttling not only ensures a good user experience (no 404 is returned) but also ensures the robustness of the system (by using the page cache, requests are intercepted at the site Layer ).
The page cache does not have to ensure that all sites return consistent pages. It is also possible to directly store the pages in the memory of each site. The advantage is that HTTP requests are sent to different sites, and the returned ticket data may be different. This is the site-layer Request Interception and Cache Optimization.
Well, this method stops programmers who write for loop HTTP requests. Some high-end programmers (hackers) have 10 million bots and 10 UIDs in their hands, send requests at the same time (without considering the real-name system, Xiaomi doesn't need the real-name system to grab the mobile phone). What should I do now? The website layer cannot be blocked by the uid throttling.
The third layer of the service layer to intercept (the request should not be dropped to the database anyway)
How can I intercept a service layer? Eldest brother, I am a service layer. I know that Xiaomi has only 10 thousand mobile phones, and I know that a train has only 2000 tickets, what is the significance of sending requests to the database? Yes, request queue!
For write requests, a request queue is used to send only limited write requests to the data layer at a time (such as placing orders and paying for write services ):
- Mobile phones, only 1 million orders are sent to the database:
- 3 K train tickets, only 3 K Orders request to go to the db.
If both are successfully placed in the next batch, if the inventory is not enough, all write requests in the queue will return "sold out ".
How to optimize read requests? Anti-Cache: whether it is memcached or redis, it is no problem that a single machine can defend against 10 million requests per second. As a result, only a very small number of write requests and a very small number of read cache mis requests will go through the data layer, and 99.9% of the requests will be blocked.
Of course, there are some optimizations in business rules. Looking back at what we did in 12306, We used to sell tickets at half past eight, and so on every 30 minutes.
Second, data granularity optimization: if you purchase tickets, there are 58 or 26 tickets left for the service of querying the remaining tickets. Do you really pay attention to this, in fact, we only care about having and having no tickets? When traffic is high, make a coarse-grained "with votes" and "without votes" cache.
Third, some business logic is asynchronous: for example, the separation between the Order Service and the payment service. These optimizations are based on the business. I have previously shared a point of view that the optimization of the "architecture design from the business is rogue" architecture should also be targeted at the business.
The database layer
The browser intercepts 80%, the site layer intercepts 99.9% and performs page caching, and the service layer implements a Write Request queue and data caching. Each request that passes through to the database layer is controllable. There is almost no pressure on the database. It is easy to worry about it and a single machine can afford it. In other words, the inventory is limited, and Xiaomi's production capacity is limited. It is meaningless to come to the database through so many requests.
All data is directed to the database, where 0% orders are placed, 0 orders are successfully placed, and the request efficiency is. 3 K to data passthrough. All requests are successful and the request efficiency is 100%.
V. Summary
The above should be very clear and I have no summary. For the second kill system, I will repeat my personal experience in two architecture optimization ideas:
Browser and APP: speed limit. Site layer: Speed limit based on uid and page cache. Service layer: Write Request queues are used to control traffic and cache data based on services. Data Layer: gossip. And business-based Optimization
Vi. Q &
Problem1According to your architecture, in fact, the most pressure is the site layer, assuming that the actual number of valid requests has1000. It is unlikely to limit the number of requests. How can this pressure be solved?
A:The concurrency per second may not be 1 kW. Suppose there is 1 kW, and there are 2 solutions:
Problem2,"Controlled10 wBots in your hand10 wItemsUidAnd send requests at the same time."How can this problem be solved?
A:As mentioned above, the service layer Write Request queue control
Problem3:Can cache with limited access frequency be used for search? For exampleAUser searched"Mobile phone",BUser search"Mobile phone", PriorityAWhat is the cache page generated after the search?
A:This is acceptable. This method is often used on "dynamic" operation activity pages, such as pushing 4kw app-push operation activities in a short time for page caching.
Problem4: What should I do if the queue fails to be processed? What should I do if a zombie breaks the queue?
A:If the processing fails, the system returns an order failure and asks the user to try again. The queue cost is very low. In the worst case, after several requests are cached, all subsequent requests directly return "no ticket" (there are already million requests in the queue, all waiting, it makes no sense to accept the request again ).
Problem5: If the site layer is filteredUidIs the number of requests separately stored in the memory of each site? If so, how do I distribute the responses of the same user to different servers through the Server Load balancer? Or should I put the filtering at the site Layer Before Server Load balancer?
A:It can be stored in the memory. In this case, it seems that a server has a limit of 5 s for one request. globally (assuming there are 10 servers), it actually has a limit of 5 s for 10 requests. solution:
Problem6: When filtering by the service layer, is the queue a uniform queue at the service layer? Or is there a queue for each server providing services? If it is a unified queue, do you need to implement lock control before requests submitted by each server enter the queue?
A:You do not need to unify a queue. In this way, each service uses a smaller number of requests (total number of votes/number of services), which is simple. It is complicated to unify a queue.
Problem7: After the second kill is paid, and the placeholder is canceled without payment, how to promptly control and update the remaining inventory?
A:A status in the database, not paid. If it exceeds the time limit, for example, 45 minutes, the inventory will be restored (the well-known "back-to-warehouse"). The inspiration for us is that after the second kill is started, we will try again in 45 minutes, maybe there is another ticket.
Problem8: Different usersBrowse the same itemIn different cache instancesThe displayed inventory is completely different.How does the instructor cache data consistency?Or are dirty reads allowed?
A:The current architecture design and requests are sent to different sites, and the data may be inconsistent (the page cache is different), which is acceptable in this business scenario. However, real data at the database level is okay.
Problem9: Optimization considerations even in the business"3 kA train ticket, only transparent3 kOrder requestsDb"Then3 kWill there be no congestion in a single order?
A:(1) The database can still defend against 3 K write requests; (2) Data splitting; (3) the service layer can control the number of concurrent requests passed through if 3 K cannot be reached, based on the pressure test, 3 K is just an example;
Problem10: If the backend processing fails at the site layer or service layer, do you need to replay the failed requests? Or directly discard it?
A:Don't replay it. If the user query fails or the order fails, one of the architectural design principles is "fail fast ".
Problem11: Flash sales for large systems, such12306At the same time, there are many seckilling activities. How can we distribute them?
A:Vertical Split
Problem12: An extra question comes to mind. Is this process synchronous or asynchronous? If it is synchronous, there may still be slow response feedback. But if it is asynchronous, how can we control the ability to return the correct response results to the requester?
A:The user level must be synchronous (the user's HTTP request is stuck), and the service level can be synchronized asynchronously.
Problem13: Q: What is the reduction in inventory in the second kill group? If it is an order lock inventory, how can a large number of malicious users place orders and lock inventory without payment?
A:The number of write requests at the database level is very low. Fortunately, the order is not paid. After the time is over, the request is "returned to the warehouse", which was previously mentioned.