According to Martin Fowler, the originator of the microservice architecture, distributed transactions should be avoided as much as possible in the microservice architecture. However, in some fields, distributed transactions are unavoidable as fate opponents.
In the engineering field, the discussion of distributed transactions mainly focuses on solutions with strong consistency and final consistency.
Typical solutions include:
- Two-phase commit (2 PC, two-phase commit) solution.
- Ebay event queue solution.
- TCC compensation mode.
- The cached data is eventually consistent.
Consistency Theory
The purpose of distributed transactions is to ensure database sharding data consistency. Cross-database transactions encounter various uncontrollable problems, such as permanent downtime of individual nodes. Acid, like standalone transactions, cannot be expected.
In addition, the industry-renowned cap theory also tells us that for distributed systems, data consistency, system availability, and partition adequacy need to be considered together on the balance.
The two-phase commit protocol (2 PC for short) is a classic solution for implementing distributed transactions, but the scalability of the two PCs is poor, and the application cost is high in the distributed architecture, dan Pritchett, architect of eBay, proposed the base theory to solve data consistency problems in large-scale distributed systems.
The base theory tells us that the scalability of the system can be exchanged by abandoning the strong consistency of the system at every moment.
01. Cap Theory
In a distributed system, consistency, availability, and partition tolerance can only meet two conditions at the same time. Partition adequacy is indispensable.
- Consistency: whether the data of multiple nodes is strongly consistent in a distributed environment.
- Availability: distributed services are always available. When a user sends a request, the service can return the result within a limited period of time.
- Partition adequacy: this refers to the adequacy of network partitions.
Example: Cassandra, Dynamo, etc. By default, AP is preferred, C is weakened, hbase, MongoDB, etc. CP is preferred by default, and A is weakened.
02. Base Theory
Core Ideas:
- Basically available: When a distributed system fails, the availability of the loss part is allowed to ensure the core availability.
- Soft State: allows a distributed system to have an intermediate state, which does not affect the overall availability of the system.
- Eventual consistency: The data of all copies in the distributed system can be consistent after a certain period of time.
Consistency Model
Data Consistency models can be divided into the following three types:
- Strong Consistency: After the data is successfully updated, the data in all copies is consistent at any time, and is generally synchronized.
- Weak Consistency: After the data is updated successfully, the system does not promise to read the latest written value immediately, nor how long it will take to read the data.
- Final consistency: a form of weak consistency. After the data is updated successfully, the system does not promise to return the latest written value immediately, but ensures that the value of the last update operation is returned.
The quorum NRW algorithm can be used to analyze the strong consistency, weak consistency, and final consistency of distributed system data.
Distributed Transaction Solution
01.2 PC solution-Strong Consistency
2 The core principle of PC is to record the phase status of the transaction commit by submitting stages and logging. After the component is down and restarted, you can use the log to restore the phase state of the transaction commit and retry at this state node.
For example, after the Coordinator is restarted, you can use the log to determine whether the submission is in the prepare or prepare all status. If it is the former, it indicates that some nodes may not have succeeded in prepare, or all nodes have succeeded in prepare but have not yet issued a commit. After the status is restored, rollback is sent to all nodes.
For the prepare all status, a commit must be issued to all nodes, and the database node must ensure the commit power.
2. Three Problems in the PC solution:
- Synchronization blocking.
- Data inconsistency.
- Single point of failure.
The upgraded 3 PC solution aims to solve these problems with two major improvements:
- Added the timeout mechanism.
- Insert preparation stage between two phases.
However, there are also some defects in the three-phase commit. to completely avoid data inconsistency at the protocol level, you can use the paxos or raft algorithm.
02. eBay event queue solution-final consistency
Dan Pritchett, architect of eBay, once mentioned a solution to eBay distributed system consistency problem in base: An acid alternative, a paper that explains the principles of base.
Its core idea is to asynchronously execute tasks that require distributed processing through messages or logs. messages or logs can be stored in local files, databases, or message queues, retry upon failure through business rules, which requires the interfaces of each service to be idempotent.
The scenario is described as user and transaction table transaction. The user table stores user information, total sales, and total purchases. The transaction table stores the serial numbers, buyer information, seller information, and transaction amount of each transaction. If a transaction is generated, you must add a record in the transaction table and modify the amount in the User table.
The solution proposed in this paper is to put the updated transaction table records and user table update messages in a local transaction to complete the process. In order to avoid repeated consumption of User table update messages, add an operation record table updates_applied to record transaction-related information.
The core of this solution is the retry and idempotent execution in the second stage. Retry upon failure. This is a compensation mechanism, which is the key process to ensure the final consistency of the system.
03. TCC (try-confirm-cancel) Compensation Mode-final consistency
A service model is a microservice architecture system composed of Service A, service B, service C, and service d. Service A needs to call service B, service C, and service d in sequence to complete an operation.
When Service A fails to call service d, to ensure data consistency throughout the system, it is necessary to roll back the invoke operations of Service B and service C and execute reverse revert operations. After the rollback is successful, the data of the entire microservice system is consistent.
Three key elements of implementation:
- The service call chain must be recorded.
- Each service provider must provide a set of operations with the opposite business logic to compensate each other. At the same time, rollback Operations must ensure idempotence.
- Different rollback policies must be executed based on the cause of failure.
Two difficulties:
- The compensation mode is simple to implement, but it is difficult to form a certain degree of general solutions, especially service chain records, because most of the time, the business parameters or business logic vary widely.
- Many business features make the service unable to provide a secure rollback operation.
04. Eventual consistency of cached data
In our business system, the cache (redis or memcached) is usually used in front of the database and serves as a buffer for Data Reading, So that I/O operations are not directly stored in the database.
Take the product details page as an example. If the seller modifies the product information and writes it back to the database, the information displayed on the product details page is still outdated data obtained from the cache, this leads to data inconsistency between the cache system and the database system.
To solve the inconsistency between cache and database data in this scenario, we have the following two solutions:
- Set the expiration time for the cached data. When the cached data expires, the business system obtains the data from the database and places the new value in the cache. This expiration time means that the system can reach the final consistent tolerable time.
- After updating the database data, cache data is cleared. After the database data is updated, the data in the cache is deleted synchronously, so that the next time you obtain the product details, you can directly obtain them from the database and synchronize them to the cache.
Recommendations
In the face of data consistency problems, we must first determine the degree of acceptance of the three consistency models from the perspective of business needs, and then determine the solution through specific scenarios.
From the application perspective, the real scenarios of distributed transactions are often unavoidable. 2 PC is also a good choice before being able to provide other solutions.
For e-commerce and financial businesses such as shopping transfers, the biggest problem of the middleware layer 2 PC is that the business is invisible. Once force majeure or unexpected consistency damage occurs.
If the data node is permanently down, it is difficult for the business to compensate based on 2 PC logs. In financial scenarios, data consistency is the root cause, and businesses need to have control over data.
We recommend that you use a distributed transaction model such as TCC or a message queue-based flexible transaction framework. These two solutions are implemented at the business layer. business developers have sufficient control capabilities and can use the SOA framework to construct the transaction, including Dubbo and spring cloud.
What you must know when processing distributed transactions in a microservice Architecture