Https://mp.weixin.qq.com/s/pxNRzWs3sZmbr-K18FvnrA
Background
Each system has its core metrics. For example, in the field of receipts: The first important part of the system is to ensure the accuracy of the entry, the second important is to ensure the single efficiency. Clearing the settlement system the first important thing is to ensure accurate money, the second important is to ensure timely payment. We are responsible for the system is the United States Group reviews Intelligent payment of the core link, bearing the intelligent payment of 100% of the traffic, internal habits known as core transactions. Because it involves the United States and the group reviews all offline trading businesses, the flow of funds between users, for the core transactions: The first important is stability, the second important or stability.
Problem raises
As a platform department, our ideal is the first phase of the rapid support business; The second stage is to control a direction; the third stage to observe the direction of the market, to lead a major direction.
The ideal is full, the reality is that since the beginning of 2017 daily hundreds of thousands of orders, by the end of the year, one-day orders have exceeded 7 million, the system faces enormous challenges. The payment channel is increasing, the link is lengthened, and the system complexity increases correspondingly. From the original POS machine to the later two-dimensional code products, small white box, small black box, seconds to pay ... Product diversification, the positioning of the system is also changing in the moment. And the system's response to change is like a turtle running against a rabbit.
Because of the rapid growth of the business, even if the system does not have any version upgrade, there will be a sudden occurrence of some accidents. The frequency of accidents is more and more high, the system itself upgrade, is often difficult. Infrastructure upgrades, upstream and downstream upgrades, often have a "butterfly effect" and are affected without warning.
Problem analysis
The problem of how to achieve high availability is fundamentally the stability of core transactions.
Availability Metrics
Industry-Highly available standards are measured in terms of system downtime:
Because the industry standard is a posteriori index, taking into account the guiding significance of the normal work, we usually use the service governance platform Octo to statistical availability. The calculation method is:
Availability decomposition
There are two more commonly used key metrics for industry system reliability:
Average no downtime (Mean time Between failures, known as MTBF): How long can the average system run normally, only one failure occurs
Average repair time (Mean times to Repair, short mttr): The average of repair time when the system is converted from a failed state to a working state
For core transactions, usability is best to be free of failures. In the failure of the time, determine the impact of factors in addition to times, there is scope. The availability of core transactions is broken down by:
Problem solving
1. The frequency is low the others die we do not die
1.1 Eliminate dependencies, weaken dependencies, and control dependencies
Use the star rule to give a scene:
situation (situation)
We are going to design a system A, complete: using the POS machine we comment on, through system A to connect the bank to make payment, we will have some full reduction, use points and other preferential activities.
Tasks (Task)
Analyze the explicit and implicit requirements for system A:
1> need to receive upstream parameters, the parameters include business information, user information, equipment information, preferential information.
2> generate a number, will deal with the order information down the library.
3> sensitive information to be encrypted.
4> to invoke the interface of the downstream bank.
5> to support a refund.
6> to the order information synchronization to the integral verification and other departments.
7> should be able to give the merchant an interface to view the order.
8> want to be able to give the merchant the settlement of the collection.
Based on the above requirements, analyze how to get inside the most core link "use pos machine payment" stability.
Actions (Action)
Analysis: Demand 1 to 4 is the necessary link for payment, can be done in a subsystem, called the collection subsystem. 5 to 8 are independent, each can be used as a subsystem to do, the specific situation and the number of developers, maintenance costs and so on.
It is worth noting that demand 5-8 and the dependencies of the collection subsystem do not have functional dependencies, only data dependencies. That is, they all depend on the generated order data.
The collection subsystem is the core of the whole system, and the stability requirement is very high. Other subsystems are out of the question and the collection subsystem cannot be affected.
Based on the above analysis, we need to do a collection subsystem and other subsystems of a decoupling, unified management of data to other systems. This is called the subscription forwarding subsystem, as long as the system does not affect the stability of the collection subsystem.
The rough diagram is as follows:
Results (Result)
As you can see from the figure above, there is no direct dependency between the collection subsystem and the refund subsystem, the settlement subsystem, the information synchronization subsystem, and the view order subsystem. This architecture achieves the effect of eliminating dependencies. The collection subsystem does not need to rely on the data subscription forwarding subsystem, and the data subscription and forwarding subsystem needs the data of the receivable subsystem. We control dependencies, the data subscription forwarding subsystem pulls data from the collection subsystem without requiring the collection subsystem to push data to the data subscription forwarding subsystem. In this way, the data subscription forwarding subsystem hangs and the collection subsystem is unaffected.
Moreover, the data subscription and forwarding subsystem pulls data in the same way. For example, the data exists in the MySQL database, by synchronizing Binlog to pull data. If Message Queuing is used for data transmission, the middleware for Message Queuing is dependent. If we design a disaster preparedness scheme: Message Queuing is dead, direct RPC calls transmit data. For this message queue, the effect of weakening dependency is achieved.
1.2 Transactions do not contain external calls
An external call includes a call to an external system and a call to an underlying component. An external invocation has characteristics that return time uncertainty and, if included in a transaction, is bound to cause large transactions. Database events can cause other requests for database connections to be lost, causing all the services associated with the database to be in the waiting state, causing the connection pool to be filled, and multiple services to go down directly. If this is not done, the risk index is five stars. The following figure shows an uncontrollable external call time:
Workaround:
Check the code for each system to see if there are time-consuming operations in the transaction, such as RPC calls, HTTP calls, Message Queuing operations, caching, circular queries, and so on, which should be moved beyond the transaction, ideally in transactions that only handle database operations.
Add a monitoring alert to a large transaction. When large transactions occur, they receive messages and SMS reminders. For database transactions, generally divided into more than 1s, 500ms above, 100ms above three kinds of transaction alarm.
It is not recommended to use XML to configure transactions, but to use annotations. The reason is the XML configuration transaction, the first readability is not strong, the second section is usually configured more flooding, easy to cause the transaction is too large, third of the rules of the nesting situation is difficult to deal with.
1.3 Set a reasonable timeout and retry
Dependencies on infrastructure components such as external systems and caches, message queues, and so on. Assuming that these dependent parties suddenly have a problem, our system response time is: internal time-consuming + dependent party Timeout time * Retry count. If the timeout is set too long, too many retries, and the system does not return for a long time, it may cause the connection pool to be filled and the system will die; If the time-out is set too short, the 499 error increases and the system's availability decreases.
As an example:
Service a relies on data from two services to complete this operation. Usually no problem, if the service B in the case you do not know, response time to become longer, or even stop the service, and your client timeout set too long, then you complete the request response time will be longer, at this time if an accident, the consequences will be very serious.
The Java servlet container, whether Tomcat or jetty, is a multithreaded model and uses a worker thread to process the request. This can be configured with an upper limit, and when your request fills the maximum number of worker threads, the remaining requests are placed in the waiting queue. There is also a limit on waiting queues, and once the queue is full, the Web server rejects the service and returns 502 on the Nginx. If your service is QPS high service, then basically this kind of scene, your service will also be worn down. If your upstream also does not have a reasonable set timeout, the failure will continue to spread upwards. The process of scaling up the fault is the service avalanche effect.
Workaround:
The first thing to do is to investigate how much time has expired for a dependent service to call downstream. The caller's time-out period is greater than the time that the dependent party calls downstream.
Count the response time for this interface 99%, and set the timeout to add 50% to that base. If the interface relies on a third party, and a third party is more volatile, you can also follow 95% response times.
Number of retries if the system service is of high importance, then by default, it is generally retried three times. Otherwise, you can not retry.
1.4 Resolve Slow Query
Slow queries can reduce the response performance and concurrency performance of your application. In the case of increased business volume, the CPU utilization of the database is soaring, which causes the database to be unresponsive and can only restart the solution. About the slow query, you can refer to our technical blog before the article "MySQL indexing principle and slow query optimization."
Workaround:
The query is divided into real-time query, near real-time query and off-line query. Real-time query can penetrate the database, others do not go to the database, you can use Elasticsearch to implement a query center, processing near real-time query and off-line query.
Read and write separation. Write away the main library, read away from the library.
Index optimization. Too many indexes can affect database write performance. Indexing is slow when queries are not sufficient. The DBA recommends that there be no more than 4 indexes for a data table.
Large tables are not allowed to appear. A data table in a MySQL database when the amount of data reaches tens, efficiency begins to drop sharply.
1.5 fuse
When a dependent service is unavailable, the service caller should, through some technical means, provide an upward compromise service to ensure that business flexibility is available. But the system does not fuse, if because the code logic problem on-line causes the breakdown, the network problem, the call timeout, the business promotion call quantity surges, the service capacity insufficiency and so on, the service call Link has a downstream service failure, may cause the access layer other business not to be able to use. The following diagram is an analysis of the fish bone image without fusing effects:
Workaround:
Auto-Fusing: You can use Netflix's Hystrix or American team to comment on your own research and development rhino to do a quick failure.
Manual fuse: Confirm downstream payment channel jitter or not available, you can manually close the channel.
2. The frequency of the lower of their own not to die
Do not die to do two points: the first not to do, the second oneself do not die.
2.1 Do not make
With regard to not doing, I summarized the following 7 points:
1> improper mice: only mature technology, not because of the problem of technology itself to affect the stability of the system.
2> Responsibility Single: does not weaken or inhibit its ability to accomplish the most important responsibilities by coupling responsibilities.
3> process Normalization: Reduce the impact of human factors.
4> process automation: Enables systems to operate more efficiently and securely.
5> capacity has redundancy: in order to deal with the system can not be used users to visit our system, big promotion, etc., and for disaster-tolerant considerations, at least to ensure that the system more than twice times the redundancy.
6> Continuous Refactoring: Continuous refactoring is an effective way to ensure that code does not move any time.
7> loophole timely replenishment: the United States Group comments have a security vulnerability operation mechanism, reminding the Department to urge all departments to repair security vulnerabilities.
2.2 Does not die
There are five great undead beasts on the Planet: "Water bears" who can stop their metabolism under harsh conditions, "lighthouse jellyfish" that can rejuvenate themselves, "clams" that recuperate in hard shells, "vortex worms" for everything from water to land and parasites, and "rotifers" with latent abilities. Their common features in the field of system design are their own fault-tolerant ability. The concept of "fault tolerance" here is that the system has the ability to tolerate failure, that is, in the event of a failure, the ability to continue to complete the specified process. Fault tolerance is Fault tolerance, specifically a tolerance failure (Fault), rather than a tolerance error (error).
3. The frequency is low to not be killed by others
3.1 Current limit
In the open network environment, the external system will often receive a lot of unintentional malicious attacks, such as DDoS attacks, user failure to brush. Although our team-mates are elite, but still have to do a good job of protection, not be affected by the neglect of the upstream, after all, who can not guarantee that other students will write a day if the downstream return does not meet the expected unlimited retry code. These huge amounts of internal and external calls, if not protected, tend to spread to the backend service, which may eventually cause background service downtime. The following diagram is a problem tree analysis of the effects of infinite streams:
Workaround:
A relatively reasonable maximum QPS can be analyzed by means of the pressure measurement of service-side business performance.
The more than three algorithms used in flow control are token bucket, leaky bucket and counter. Can be implemented using the guava Ratelimiter. Smoothburstry is based on the token bucket algorithm, Smoothwarmingup is based on the leaky bucket algorithm.
Core trading this side uses the United States Regiment service management platform Octo do Thrift River closure. Support for interface granularity quotas, support for stand-alone/cluster quotas, support for specified consumer quotas, support for test mode work, and timely alert notification. wherein the test mode is only alarm not really throttling. The shutdown test mode exceeds the current limit threshold system for exception throwing processing. The current limit policy can be closed at any time.
You can use Netflix's Hystrix or American team to comment on your own rhino to make specific targeted limits.
4. The fault range should be small isolation
Isolation is the separation of the system or resources, in the event of system failure can limit the scope of transmission and impact.
Server Physical Isolation Principle
① inside and outside: internal system and open platform differentiate treat.
② internal isolation: from upstream to downstream by channel from the physical server isolation, low traffic service consolidation.
③ external isolation: By Channel isolation, the channels do not affect each other.
Thread Pool Resource Isolation
Hystrix encapsulates each type of business request into a corresponding command request through a command mode. Each command request corresponds to a thread pool, and the created thread pool is placed into the concurrenthashmap.
Note: Although the thread pool provides threading isolation, the client-side code must have time-out settings and cannot be blocked indefinitely so that the thread pool is saturated.
Semaphore Resource Isolation
Developers can use Hystrix to limit the maximum concurrency of a system to a dependency, which is basically a current-limiting strategy. Each invocation relies on a check to see if the semaphore's limit is reached and, if so, rejected.
5. Quick discovery of fault recovery
The findings are divided into prior discoveries, discoveries in events and subsequent discoveries. The main means of prior discovery is pressure measurement and malfunction drill, the main means of detection is monitoring and alarming, and the main means of hindsight is data analysis.
5.1 Full chain route on the pressure test
Is your system suitable for full chain alignment? In general, the full link pressure test applies to the following scenarios:
① for link length, many links, the service relies on the complex system, the whole chain route on the test can be faster and more accurate positioning problems.
② has a complete monitoring alarm, there are problems can terminate at any time operation.
③ has an obvious business peak and trough. The trough period even if has the problem to the user influence is also relatively small.
The main purpose of the whole chain route is:
① understand the processing power of the entire system
② Troubleshooting Performance Bottlenecks
③ Verify that the mechanisms for limiting, demoting, fusing, alerting are in line with expectations and analyze data in turn to adjust these thresholds and other information
Does the release of ④ meet expectations at the peak of the business
⑤ Verify system dependencies are expected
Simple implementation of full-link pressure measurement:
① Collect online log data to do the flow playback, in order to and the actual data flow isolation, the need for some of the field offset processing.
② Data coloring processing. You can use middleware to get and deliver traffic labels.
③ can use shadow data tables to isolate traffic, but pay attention to disk space, it is recommended that if the disk remaining space less than 70% in other ways to isolate traffic.
④ external calls may require a mock. The implementation can be based on a mock service random generation and an online external call to return time distribution delay.
On the pressure measuring tool, the core trading side uses the ptest developed by the American team.
6. Quick location for fault recovery
Positioning requires reliable data. The so-called "reliable" is closely related to the problem to be found, unrelated data will create visual blind spots, affecting positioning. So for the log, you need to develop a concise logging specification. In addition, system monitoring, business monitoring, component monitoring, real-time analysis and diagnostic tools are also the effective fingers of positioning.
7. Fast resolution of recovery
To be resolved, advance is found and positioned. The speed of the solution also depends on whether it is automated, semi-automatic or manual. The core transaction is intended to build a highly available system. Our slogan is: "Do not repeat the wheel, with good wheels." "This is an integrated platform, the responsibility is:" Focus on core transactions high availability, better, faster, more efficient. ”
The United States group reviews can be used for discovery, positioning, processing of the system and platform is very much, but if a link or login to open the system, will affect the speed of resolution. So we have to do integration, let the problem one-stop solution. Examples of desired results are as follows:
Tools Introduction
Hystrix
Hystrix realizes the circuit breaker mode to monitor the fault, when the breaker finds that the calling interface has been waiting for a long time, it uses the fast failure strategy to return an error response, so as to prevent blocking. This article focuses on the HYSTRIX thread pool resource isolation and semaphore resource isolation.
Thread Pool Resource Isolation
Advantages
A thread can completely isolate third party code, and the request thread can be quickly put back.
When a failed dependency becomes available again, the thread pool will clean up and immediately restore the available instead of a long recovery.
Asynchronous calls can be fully emulated to facilitate asynchronous programming.
Disadvantages
The main disadvantage of a thread pool is that it increases the CPU because the execution of each command involves queueing (default use of synchronousqueue to avoid queuing), scheduling, and context switching.
Adding complexity to code that relies on thread state, such as threadlocal, requires manual delivery and cleanup of thread state (within Netflix it is considered that the thread isolation overhead is small enough to cause significant cost or performance impact).
Semaphore Resource Isolation
Developers can use Hystrix to limit the maximum number of concurrent systems on a dependency. This is basically a current-limiting policy, each call depends on the check whether to reach the threshold of the semaphore limit, if achieved, rejected.
Advantages
No new threads execute commands to reduce context switching.
Disadvantages
Unable to configure the circuit breaker, each time will try to get the semaphore.
Compare thread pool resource isolation and semaphore resource isolation
Thread isolation is run by other threads that are not related to the main thread, and semaphore isolation is an operation that is done on the same thread as the primary thread.
Semaphore isolation can also be used to restrict concurrent access, preventing blocking proliferation, and the biggest difference from thread isolation is that the thread that executes the dependent code remains the requesting thread.
Thread pool isolation is suitable for third party applications or interfaces, large concurrency isolation, and signal isolation for internal applications or middleware; Concurrent requirements are not very large scenarios.
Rhino
Rhino is a group of the United States to review the infrastructure team developed and maintained a stability protection component, providing fault simulation, degraded walkthroughs, service fuses, service limits and other functions. Compared with hystrix:
Internal through the Cat (American group comments Open source Monitoring system, see the previous blog "In-depth analysis of open source distributed monitoring Cat") carried out a series of buried points to facilitate the service abnormal alarm.
Access Configuration Center, can provide dynamic parameter modification, such as forced fusing, modification failure rate.