Data consistency is an important issue to be considered in building business systems, in the past we rely on databases to ensure data consistency. However, it is a very challenging problem to achieve data consistency in the MicroServices architecture and distributed environment. Recently in the study of distributed things, distributed solutions have many solutions, but also let me in the study also triggered a lot of thinking. What I want to talk about today is that the distributed solutions are related to the saga.
Original address: Data consistency solution in MicroServices scenario PPT address: Saga Distributed transaction Solution and Practice Incubator-servicecomb-saga Address: Incubator-servicecomb-sagaservicecomb-saga-csharp (Servicecomb-saga Netcore SDK) Address: Servicecomb-saga-csharp
To make some explanatory places easier to understand according to the original text
Data consistency for monomer applications
I would like to tell you a frequently used examples of foreign countries, that is, if there is a large enterprise, affiliated airlines, car rental companies, and chain hotels. This large company provides a one-stop travel planner for customers who only need to provide travel destinations, a large company that can help customers book flights, rent cars, and book hotels. From the business point of view, we must ensure that the above three service bookings are completed to meet a successful travel itinerary, otherwise it is not possible.
Our monolithic application to meet this requirement is very simple, just put this three service request into the same database transaction, the database will help us to ensure that all successful or all rollback.
These three services on-line company satisfaction, the customer is also very satisfied
Data consistency in micro-service scenarios
As time went on, the travel planning service of the big business was very successful and the user volume soared. The company's affiliated airlines, car rental companies, and chain hotels have also introduced more services to meet customer needs, and our application and development team has become increasingly large. Today, our monolithic applications have become so complex that no one knows how the entire application works. What's worse is that the launch of the new feature now requires the collaboration of all research and development teams to work day and night for weeks to complete. Looking at the deteriorating market share, the company is increasingly dissatisfied with the research and development department.
After several rounds of discussion, the leader finally decided to divide the huge monomer application into four: Ticket booking service, car rental service, hotel reservation service, and payment service. Services use their own databases and communicate over HTTP protocols. Responsible for the services of the team according to market demand according to their own development rhythm issued on-line. Now we are facing a new challenge: how can we ensure that the first three bookings are completed to meet a successful travel itinerary, otherwise it cannot be a business rule? Now the services have their own boundaries, and the database selection is not the same, through the database to ensure that the data consistency of the scheme is not feasible.
Sagas
After a period of searching, I found a paper, 1987 Hector & Kenneth published paper sagas address
A saga is a longer transaction (Long Live Transaction (LLT)) that can be decomposed into a set of sub-transactions that can be interleaved. Each of these child transactions is a real transaction that maintains database consistency (LLT = T1 + T2 + T3 + ... + Tn). Each local transaction TX has a corresponding compensation of Cx.
In the business scenario of a large enterprise, a trip planning transaction is a saga that contains four sub-transactions: Ticket bookings, car rentals, hotel bookings, and payments.
According to the formula mentioned above
When each Saga sub-transaction T1, T2, ..., TN have a corresponding compensation definition C1, C2, ..., Cn-1, then the saga system can guarantee [1] sub-transaction sequence T1, T2, ..., TN to complete (best case) or sequence
T1, T2, ..., Tj, Cj, ...,
C2, C1, 0 < J < N,
Can be completed
In other words, with the transaction/compensation defined above, the saga guarantees that the following business rules are met:
All bookings are executed successfully and if any one fails, it will be canceled
If the payment fails at the last step, all bookings will also be canceled and these cancellations are so-called compensation.
How the saga recovers
Two types of Saga recovery methods are described in the original paper:
Backward recovery compensates for all completed transactions if any of the child transactions fail. Forward recovery retry failed transactions, assuming that each child transaction will eventually succeed
Obviously, forward recovery is not necessary to provide compensation transactions, and if your business, child transactions (eventually) always succeed, or compensation transactions are difficult to define or impossible, forward recovery is more in line with your needs.
Theoretically compensating transactions never fail, however, in the distributed world, let's think of extreme situations where there are three possible considerations, successes, failures, timeouts (which may or may not succeed). Then the server may be down, the network may fail, and even the data center may be out of power. What can we do in this situation? The last resort is to provide fallback measures, such as manual intervention.
Additional notes: Acid and Saga
- Atomicity (atomicity): Sagas provide only ACD guarantees, atomicity (implemented through the saga Coordinator)
- Consistency (consistency): Local transaction + Saga log
- Isolation (Isolation): Sagas do not guarantee
- Persistence (Durability): Saga log provides
There are a lot of friends who would say why not provide isolation?
Example Address: Address
- Two saga transactions simultaneous manipulation of a resource can present inconsistent data semantics
- Two saga transactions operate an order at the same time, overwriting each other (update lost)
- Two saga transactions simultaneously access the debit account, unable to see refunds (Dirty read problem)
- Within a saga transaction, the data is inconsistent with the read value before and after the other transaction modification (fuzzy read problem)
How should we deal with the problem of isolation in the face of the above problems?
The corresponding solution is given below
- The nature of isolation is to control concurrency, preventing concurrent transactions from operating the same resources and causing confusion in results
- Logic for adding logical locks at the application level.
- This portion of the funds is segregated at the operational level by means of pre-freezing funds.
- The update is obtained by reading the current state in a timely manner during a business operation.
Conditions for using a saga
The saga looks promising to meet our needs. Can all longer transactions do this? Here are some restrictions:
Saga allows only two levels of nesting, top-level sagas and simple child transactions [1]
In the outer layer, the whole atom cannot be satisfied. In other words, sagas may see some of the results of other sagas [1]
Each sub-transaction should be independent of atomic behavior [2]
In our business scenario, flight bookings, car rentals, hotel bookings, and payments are natural and independent, and each transaction can be guaranteed atomic operations with a database of corresponding services.
We do not need atomicity at the level of the travel planning service. One user can book the last ticket and then be canceled due to insufficient credit card balance. At the same time another user may start to see the free ticket, and then because the former reservation was canceled, the last ticket was released, and the last seat was taken to complete the plan.
There are also matters to be considered for compensation:
The compensating transaction reverses the behavior of the transactional ti from a semantic perspective, but it may not be possible to return the database to the state in which TI was executed. (for example, if a transaction triggers a missile launch, you may not be able to undo this operation)
But that's not a problem for our business. In fact, it is possible to compensate for the hard-to-undo behavior. For example, a transaction that sends an email can be compensated by sending another e-mail explaining the problem.
Now we have a scenario for solving data consistency problems through a saga. It allows us to successfully execute all transactions, or compensate for a successful transaction in the event of any transaction failure. Although the saga does not provide acid assurance, it still applies to many scenarios where data is ultimately consistent. So how do we design a saga system?
Saga Log
The saga guarantees that all sub-transactions are completed or compensated, but the saga system itself may crash. When a saga crashes, it may be in the following states:
- The saga received a transaction request but has not yet started. The microservices state corresponding to the factor transaction has not been modified by the saga and we do not need to do anything.
- Some of the child transactions have been completed. After rebooting, the saga must then resume the last completed transaction.
- The child transaction has started but has not yet completed. Because the remote service may have completed the transaction, the transaction may fail, and even the service request times out, the saga can only re-initiate a child transaction that was not previously confirmed. This means that the child transaction must be idempotent.
- The child transaction failed with a compensating transaction that has not yet started. A saga must perform a corresponding compensation transaction after a reboot.
The compensation transaction has started but has not been completed. The solution is the same as the previous. This means that the compensation transaction must also be idempotent.
All child or compensating transactions have been completed, in the same way as in the first case.
In order to revert to the above state, we must trace each step of the child transaction and the compensation transaction. We decided to meet the above requirements by event and save the following events in a persistent store named Saga log:
- Saga started event saves the entire saga request, which includes multiple transaction/compensation requests
- Transaction started event save corresponding transaction request
- Transaction ended event save corresponding transaction request and its reply
- Transaction aborted event saves the corresponding transaction request and the reason for the failure
- Transaction compensated event Save corresponding compensation request and its reply
- Saga ended event marks the end of a saga transaction request and does not require any content to be saved
By persisting these events in the Saga log, we can restore the saga to any of these states.
Since the saga only needs to be persistent with events, and the event content is stored as JSON, the saga log is very flexible to implement, a database (SQL or NoSQL), a persistent message queue, and even ordinary files can be used as event stores, but some of them can help the saga recover more quickly.
Data structure of the saga request
In our business scenario, flight bookings, car rentals, and hotel bookings are not dependent and can be processed in parallel, but for our customers it is only more friendly to pay once all bookings have been successful. Then the transaction relationships of these four services can be represented by:
It is appropriate to implement the data structure of the trip planning request to a non-circular graph. The root of the graph is the Saga Launcher task, and the Leaf is the Saga End task.
Parallel Saga
As mentioned above, flight bookings, car rental and hotel bookings can be processed in parallel. But doing so can create another problem: what if a flight reservation fails and the car is being processed? We can't wait for the car rental service to respond because we don't know how long it will take.
The best way is to send the rental request again and get a response so we can continue to compensate for the operation. But if the car rental service never responds, we may need to take fallback measures, such as manual intervention.
A time-out reservation request may eventually be received by the car rental service, when the service has processed the same reservation and cancellation request.
Therefore, the implementation of the service must ensure that after the compensation request is executed, the corresponding transaction request that is received again is invalid. Caitie McCaffrey in her speech distributed sagas:a Protocol for coordinating microservices This is called exchangeable compensation request (commutative Compensating request).
Distributed Saga architecture
The distributed saga draws on Zipkin's idea that Omega is a probe-like form that reports a Saga event, and then Alpha is the Processmanager of the saga. That's what the coordinator is.
- Alpha acts as the Coordinator, primarily responsible for persisting the events of the transaction and reconciling the state of the child transaction so that it is ultimately consistent with the state of the global transaction.
- Omega is an agent embedded in microservices that intercepts network requests and escalate transaction events to Alpha, and performs compensation operations in exceptional cases based on alpha-issued instructions.
Next we look at Omega's internal implementation
Omega is an agent embedded in a microservices service. When the service receives the request, Omega intercepts it and extracts the global transaction ID from the request information as its own global transaction ID (the Saga event ID) and extracts the local transaction ID as its parent transaction ID. During the preprocessing phase, Alpha records the events at which the transaction begins, and in the post-processing phase, Alpha records the events at the end of the transaction. Therefore, each successful child transaction has one by one corresponding start and end events.
Let's see how they communicate.
The process of inter-service communication is similar to Zipkin. At the service producer, Omega intercepts the transaction-related ID in the request to extract the transaction context. In the service consumer, Omega injects the transaction-related ID into the request to pass the transaction context. Through this collaborative process of service providers and service consumers, child transactions can be linked together to form a complete global transaction.
With the help of Zipkin, you can make the whole transaction group form a chain structure.
Saga Specific processing process
The saga processing scenario is a requirement that the related child transactions provide a transaction function and also provide a compensation function. The Saga Coordinator Alpha sends an associated instruction to Omega based on the execution of the transaction, determining whether to retry or restore backwards.
Success Scenarios
In a successful scenario, each transaction will have a start and a corresponding end event.
Exception scenario
In an unusual scenario, Omega will escalate an interrupt event to Alpha, and then Alpha will send a compensation instruction to the other completed child transactions of the global transaction, ensuring that eventually all child transactions are either successful or rolled back.
Timeout scenario (adjustment required)
In a timeout scenario, an event that has timed out is detected by the Alpha periodic scanner, and the global transaction corresponding to that time-out transaction is also interrupted.
The above is the introduction of Incubator-servicecomb-saga overall architecture. I think its idea is very nice, so I and water elder brother, and Old du did a very interesting thing. What's the matter? is to achieve the Omega this client, GitHub address here: Servicecomb-saga-csharp, currently implementing the above three scenarios.
The next chapter of the actual sample and we explain the next netcore under the implementation of this article to let everyone understand what is a saga.