In-depth analysis: transactional processing of distributed systems classic problems and models (reprint sharing)
Summary:Distributed systems require a balance between data integrity, consistency, and performance. This paper introduces the technical model of distributed data consistency, such as: master-slave,master-master,2pc/3pc, classic general problem, Paxos, and Dynamo NRW and Vectorclock model.
Editor's note: High availability of data services is what all businesses want, but to make data highly available, redundant data is required to write multiple copies. Writing more than one problem will bring about consistency, and the problem of consistency will lead to performance problems, and this will lead to a dead loop without solution! The so-called data consistency is that when multiple users try to access a database at the same time, if their transactions use the same data at the same time, the following four scenarios can occur: missing updates, indeterminate correlations, inconsistent analysis, and Phantom reads. This article will give you a systematic introduction of a variety of processing distributed data consistency of the technical model, the following is the author's original text:
When you use a server to provide data services on a production line, you will often encounter two issues such as:
- The performance of one server is not sufficient to provide sufficient capacity to serve all network requests.
- Worry about server outages, resulting in service unavailability or data loss.
Faced with these problems, we have to expand the server, add more machines to share performance issues, and solve single point of failure problems. In general, we will extend our data services in two ways:
- Data partitioning: data chunking is placed on different servers (for example: UID% 16, consistent hash, etc.).
- Data mirroring: Synchronize all server data to provide non-differentiated data services.
Using the first scenario, the data loss problem cannot be solved, when a single server problem, there will be some data loss. Therefore, the high availability of data services can only be done in the second way-redundant storage of data (the general industry believes that more secure backups should be 3, such as: Hadoop and Dynamo). However, the more machines you join, the more complex the data becomes, especially across server transactions, that is, data consistency across servers. This is a difficult question! Let us use the most classic uses case: "A account to the B account remittance money" to illustrate, familiar with the RDBMS transaction know from account A to account B requires 6 operations:
- Read the balance from the a account;
- Do subtraction operations on a account;
- Write the results back to the a account;
- Read the balance from the B account;
- Do addition operation to B account number;
- Write the results back to the B account.
In order to data consistency, these 6 things, either successfully completed, or not successful, and the process of the operation, A, B account of the other access must be locked, so-called locking is to exclude other read and write operations, or there will be dirty data problems, this is the transaction. However, when multiple machines are added, this can become complicated:
- In the scenario of data partitioning: What if the data for A and B accounts are not on the same server? We need a cross-machine transaction. In other words, if the deduction of a is successful, but B's add money is not successful, we also have to roll back the operation of a. It is more complex to implement on different machines.
- In the scenario of data mirroring: The money transfer between A and B accounts can be done on a single machine, but don't forget that we have multiple machines with a copy of A and B accounts. What if there are two concurrent operations (to be remitted to B and C) for the remittance of the a account, and what happens if the two operations occur on two different servers? That is, in the data mirroring, on different servers on the same data on the write operation how to ensure consistency, ensure that the data do not conflict?
At the same time, we have to consider the performance factors, if the performance is not considered, the completion of the transaction is not difficult, the system is a little slower on the line. In addition to considering performance, we have to consider usability, that is, a machine is gone, data is not lost, and the service can continue to be provided by other machines. So, we need to focus on the following scenarios:
- Disaster tolerance: data not lost, node failover
- consistency of data: transaction processing
- Performance: throughput, Response time
As mentioned earlier, to solve the data is not lost, only through the data redundancy method, even if the data partition, each area also needs data redundancy processing. This is the copy of the data: when the data loss of a node can be read from the copy, the data copy is the only way for the distributed system to solve the data loss anomaly. Therefore, in this article, we only discuss the problem of data consistency and performance in the case of data redundancy. Simply put:
- To make the data highly available, you have to write multiple copies of the data.
- Writing multiple copies of a problem can lead to data consistency issues.
- Data consistency issues can lead to performance issues
This is software development, press the gourd up the scoop.
Consistency model
Speaking of data consistency, there are three simple types (of course, if the subdivision, there are many consistency models, such as: sequential consistency, FIFO consistency, session consistency, single-read consistency, single-write consistency, but for the sake of this simple and easy to read, I only say the following three kinds):
- Weak Weak consistency: When you write a new value, the read operation may or may not be read on the copy of the data. For example: Some cache system, online game other players data and you do not have anything to do with the system, such as VoIP, or Baidu search engine.
- eventually final consistency: when you write a new value, you may not be able to read it, but after a certain time window it is guaranteed to eventually read it. For example: DNS, e-mail, Amazon S3,google search engine such as the system.
- Strong strong consistency: Once the new data is written, the new value can be read at any moment in any copy. For example: File system, Rdbms,azure table are strong consistency.
From these three consistent models, we can see that weak and eventually are generally asynchronous redundancy, and strong is generally synchronous redundancy, which usually means better performance, but also means more complex state control; synchronization means simplicity, but it also means performance degradation. Let's get down to the next step and see what the technology is:
Master-slave
The first is the master-slave structure, and for this kind of refactoring, slave is generally a backup of master. In such a system, it is generally designed as follows:
- Read and write requests are the responsibility of master.
- After the write request is written to master, the master synchronizes to the slave.
From master to slave, you can use asynchronous or synchronous, you can use master to push, or you can use slave to pull. Usually it is slave to periodic pull, so it is final consistency. The problem with this design is that if master collapses in the pull cycle, it can result in data loss on this time slice. If you don't want the data to be lost, slave can only be a read-only way to master recovery.
Of course, if you can tolerate the loss of data, you can immediately let slave instead of Master work (for the compute-only nodes, there is no data consistency and data loss problems, Master-slave way to solve a single point of the problem) of course, Master Slave can also be strong consistency, such as: When writing master, Master is responsible for the first backup, and so on success, then write slave, both successful return to success, the whole process is synchronous, if the write slave failed, then two ways, One is to mark the slave unavailable error and continue the service (such as slave after recovery Sync master data, you can have multiple slave, so one less, there is a backup, as previously said to write three copies), the other is to rollback themselves and return to write failure. (note: Generally do not first write slave, because if write master own failure, but also to roll back slave, at this time if rollback slave failure, you have to manually revise the data) can see, if master-slave need to make strong consistency is how complex.
Master-master
Master-master, also known as multi-master, refers to a system that has two or more master, each of which provides a read-write service. This model is a master-slave-enhanced version, and data synchronization is generally done asynchronously through master, so it is ultimately consistent. The advantage of Master-master is that one master hangs up, and the other master can read and write services, which, like Master-slave, is lost when the data is not copied to the other master. Many databases support Master-master's replication mechanism.
In addition, if more than one master modifies the same data, the nightmare of the model arises-it is difficult to merge conflicts between the data. Look at the design of Dynamo's vector clock (the version number of the data and the person who modified it) to know that it is not that simple, and that the dynamo of the data conflict is handed to the user. Just like the SVN source conflict, the conflict with the same line of code can only be handled by the developer itself. (Dynamo's vector Clock will be discussed later in this article)
Two/three Phase Commit
The abbreviation of this agreement is also called 2PC, Chinese is called two-phase commit. In a distributed system, each node, while aware of its own operation, succeeds or fails, does not know the success or failure of the operation of the other node. When a transaction spans multiple nodes, in order to maintain the acid nature of the transaction, it is necessary to introduce a component that acts as a coordinator to unify the operation results of all nodes (called contributors) and ultimately to indicate whether the nodes are actually committing the results of the operation (such as writing updated data to disk, etc.). The two-phase commit algorithm is as follows:
First stage:
- The coordinator asks all participant nodes if they can perform the commit operation.
- Each participant begins the preparation of the transaction execution: for example: Lock resources, reserve resources, write Undo/redo log ...
- The participant responds to the Coordinator, and if the preparation of the transaction succeeds, the response is "can be committed," otherwise the response is "reject submission".
Phase II:
- If all participants respond to "can submit", then the facilitator sends a "formal submit" command to all participants. Participants complete the formal submission, release all resources, and then respond to "done", and the facilitator ends the global Transaction by collecting "complete" responses from each node.
- If one participant responds with "Reject Commit," the coordinator sends a "rollback" to all participants, frees all resources, and then responds with "rollback complete", and the coordinator collects the "rollback" response from each node to cancel the global Transaction.
Can see, 2PC is the first stage to do vote, the second stage to make a decision of an algorithm, you can also see 2PC this thing is strong consistency algorithm. The strong consistency strategy discussed in the previous Master-slave, and 2PC a bit similar, but 2PC more conservative-first try to commit again. 2PC is used more, in some system design, a series of calls, such as: A, B, C and D, each step will allocate some resources or rewrite some data. For example, online shopping on the order of the next operation in the background there will be a series of processes to do. If you do it step by step, you will get this problem, if a step can not be done, then each time the allocation of resources in front of the need to do the reverse operation to recycle them all, so, the operation is more complex. Many processing processes (Workflow) now use the 2PC algorithm to ensure that the entire process can be completed successfully using the process of try-and confirm. As a popular example, when a Western church marries, it has such a bridge segment:
- The priest asked the groom and the bride separately: Are you willing to ... Regardless of sickness and death ...
- When both the groom and the bride respond willingly (locking in a lifetime of resources), the priest will say: I declare you ... (Transaction submission)
This is how classic a two-phase commit transaction. You can also see some of these problems, a) one of which is a synchronous blocking operation, which is bound to greatly affect performance. B) Another major problem is on the timeout, for example,
- if in the first phase, the participant did not receive a request for questioning , or the participant's response did not reach the coordinator. Then, the coordinator is required to do the timeout processing, once the timeout, can be considered a failure, you can retry.
- if in the second stage, after the formal submission is issued, if the participant does not receive , or the participant submits/rolls back the confirmation message is not returned, once the participant's response times out, either retry, or mark the participant as the problem node to reject the entire cluster, This ensures that the service nodes are data-consistent.
- The bad thing is, in the second stage, if the participant does not receive the Coordinator's Commit/fallback directive , the participant will be in the "status unknown" stage, and the participant is completely unaware of what to do, such as: If all participants complete the first phase of the reply (possibly all yes, Probably all no, maybe part yes part no), if the coordinator hangs up at this time. Then all the nodes do not know what to do (ask the other participants are not). For consistency, either death the coordinator or re-sends the first-stage yes/no command.
The biggest problem with two commits is 3rd, if the participant does not receive a decision in the second order after the first phase is completed, then the data node will be "overwhelmed" and the state will block the entire transaction. In other words, the Coordinator coordinator is important for the completion of the transaction, and coordinator availability is a key. As a result, we introduced three paragraphs, and the three paragraphs presented on Wikipedia are described below, and he break the first paragraph of the two paragraph into two paragraphs: ask, and then lock the resources. Finally actually submitted. The three paragraphs are submitted as follows:
The core idea of a three-part submission is that you don't lock in resources when you ask, and only start locking resources unless everyone agrees.
Theoretically, if all nodes in the first phase return to success, then there is reason to believe that the probability of a successful commit is high. In this way, you can reduce the probability that the participant cohorts state is unknown. That is, once the participant receives the PRECOMMIT, it means that he knows that everyone agrees to the change. This is important. Here's a look at the status migration diagram for 3PC: (dotted lines in the f,t, which are failuer or timeout, where the state meaning is Q–query,a–abort,w–wait,p–precommit,c–commit)
In fact, the three-paragraph submission is a very complex matter, it is very difficult to achieve, and there are some problems.
See here, I believe you have a lot of problems, you must be thinking about the various failure scenarios in 2pc/3pc, you will find that timeout is a very difficult thing to deal with, because the network timeout in a lot of time to let you have nothing to do, you do not know whether the other side did or did not do. So you have a good state machine because of the timeout became a decoration.
A network service will have three statuses: 1) success,2) failure,3) Timeout, and the third is definitely a nightmare, especially when you need to maintain state.
Two generals problem (general question)
Two generals problem The question is such a question of thought: There are two armies, each of whom is headed by a general, who is now ready to attack a city built with fortifications. The two armies were stationed in the vicinity of the city, occupying a hill. A valley separates the two mountains, and the only way for the two generals to communicate is to send their messengers on both sides of the valley. Unfortunately, the valley has been occupied by the city's defenders, and there is a possibility that any messenger sent through the valley will be arrested. Please note that although two generals have reached a consensus on attacking the city, they have not reached a consensus on the attack time until their respective occupation of the hill. Two generals must let their troops attack the city at the same time to succeed. Therefore, they must communicate with each other to determine a time to attack, and agree to attack at that moment. If there is only one general to attack, then this will be a disastrous failure. This thinking experiment involves considering how the general is going to do the job. Here's how to think about this thing:
- The first general sent a message "Let's attack at 9 o'clock in the morning." However, once the messenger was dispatched, whether he had passed the valley, the first general was not known. Any uncertainty would cause the first general to attack and hesitate, because if the second general cannot attack at the same time, the garrison of that city will repel the attack of his army, causing his army to be destroyed.
- Knowing this, the second general would need to send a confirmation message: "I receive your message and will attack at 9 points." "But what if the messenger with the confirmation message gets caught?" So the second general will hesitate to confirm whether the message arrives.
- So it seems that we have to send a confirmation message to the first general-"I received your confirmation". But what if the messenger is caught?
- In this way, is not we also want the second general to send a "confirm Receive your confirmation" information.
And you will find that it quickly becomes a matter of how many confirmation messages are sent, and there is no way to ensure that two generals have enough confidence that their messengers are not captured by the enemy.
There is no solution to this problem. Two general questions and its non-solution proofs were first published by E.a.akkoyunlu,k.ekanadham and R.v.huber in 1975 in the article "some restrictions and compromises in network communication design", as described in a paragraph on page 73rd of this article describing the communication between the two gangs. 1978, in Jim Gray's "Database Operating system Considerations" book (starting from page No. 465) was named the two general paradox. This reference is widely mentioned as the source of the definition of the two generals ' question and the proof of non-interpretation.
The experiment was intended to illustrate the challenge of trying to reconcile an action with a communication that is based on an unreliable connection.
In engineering, a practical solution to the problem of the two generals is to use a scheme that can withstand the unreliability of communication channels and not attempt to eliminate this unreliability, but reduce the reliability to an acceptable level. For example, the first general discharged 100 messengers and expected them to be less likely to be arrested. In this case, the first general will attack whether or not the second general will attack or receive any news. In addition, the first general can send a message flow, and the second general can send a confirmation message to each of these messages, so that if each message is received, two generals will feel better. From the proof, however, neither of them is certain that the attack can be reconciled. They do not have algorithms available (for example, to attack by receiving more than 4 messages) to ensure that only one party is prevented from attacking. Moreover, the first general can also be numbered for each message, saying this is number 1th, number 2nd ... Until the n number. This method allows the second general to know how reliable the communication channel is, and returns the appropriate number of messages to ensure that the last message is received. If the channel is reliable, as long as a message is on the line, the rest will not be helpful. The probability of the last and first messages being lost is equal.
The question of the two generals can be extended to the more perverted Byzantine general question (Byzantine generals problem), whose story is this: Byzantium is in Istanbul, now in Turkey, the capital of the Eastern Roman Empire. Because of the vast territory of the Byzantine Empire at that time, for the purpose of defence, each army was separated very far, the general and the general can only rely on Messenger to preach news. At the time of the war, all the generals in the Byzantine army had to agree on a consensus and decide if there was a chance to win before attacking the enemy's camp. However, the army may have traitors and enemy spies, and these traitor generals will disrupt or sway the decision-making process. At this time, in the case of known members of the rebellion, the rest of the loyal generals without the influence of the traitor how to reach agreement, this is the question of Byzantine general.
Paxos Algorithm
The description of the various Paxos algorithms on Wikipedia is very detailed, and you can go to the onlookers.
The problem with the Paxos algorithm is how to agree on a value in a distributed system where the above anomalies can occur, ensuring that no matter what happens above, the consistency of the resolution is not compromised. A typical scenario is that in a distributed database system, if the initial state of each node is consistent, each node executes the same sequence of operations, then they can finally get a consistent state. To ensure that each node executes the same sequence of commands, a "consistency algorithm" is executed on each instruction to ensure that the instructions seen by each node are consistent. A general consistency algorithm can be applied in many scenarios and is an important problem in distributed computing. The study of consistency algorithms has not stopped since the 1980s.
The Notes:paxos algorithm is Leslie Lambert (Leslie Lamport, the "La" in LaTeX, which is now in Microsoft Research), which was presented in 1990 as a consistency algorithm based on message passing. Because the algorithm is difficult to understand at first did not arouse the attention of people, so that Lamport in 1998 eight years later re-published to ACM Transactions on computer systems. Even so the Paxos algorithm was not paid attention to, in 2001 Lamport felt that his peers could not accept his sense of humor, and then in an easy-to-accept way to re-express again. It can be seen that Lamport has a unique feeling for Paxos algorithm. In recent years, the general use of Paxos algorithm has proved its important position in distributed consistency algorithm. 2006 Google's three papers at the beginning of the "cloud", wherein the chubby lock service using Paxos as the Chubby cell consistency algorithm, Paxos's popularity from the road. (Lamport himself described in his blog that he spent 9 years publishing the algorithm of the past and the old)
Note: All cloud services in Amazon's AWS are based on an ALF (Async Lock Framework) framework that is used by the Paxos algorithm. When I was at Amazon, when I looked at the in-house share video, the designer said in the internal principle talk that he was referring to the zookeeper approach, but that he implemented the algorithm in another way that was easier to read than zookeeper.
In short, the purpose of Paxos is to allow the nodes of the entire cluster to agree on a value change. The Paxos algorithm is basically a democratically elected algorithm-most decisions will be a unified decision for the entire cluster. Any point can propose a proposal to modify a certain data, whether or not this proposal depends on whether there is more than half of the node's consent in the cluster (so the Paxos algorithm requires that the nodes in the cluster be singular).
This algorithm has two phases (assuming this has three nodes: A,b,c):
Phase I: Prepare stage
A request for modification is prepare to all node a,b,c. Note that the Paxos algorithm will have a sequence number (which you can think of as a proposal, which is constantly incrementing and unique, i.e. A and B cannot have the same proposal number), and the resolution will be issued with the request for modification, any node in the "Prepare phase". Will reject a request that is less than the current proposal number. Therefore, node A will need to bring a proposal number, and the more new one, the more the greater the proposal, when requesting a change to all the nodes.
If the receiving node receives the proposal number n greater than the other dots, the node responds to Yes (the latest approved proposal at this point) and is guaranteed not to accept other <n proposals. In this way, the node is always committed to the latest proposals in the prepare phase.
Optimization: In the above-mentioned prepare process, if any one of the nodes found a higher numbered proposal, you need to inform the sponsor and remind him to interrupt the proposal.
Phase II: Accept Phase
If the sponsor a receives a yes from more than half of the nodes, then he publishes the accept Request to all results (again, with the proposal number N), and if there are not more than half, it returns to failure.
When the node receives the accept request, if N is the largest for the received result, then it modifies the value and if it finds itself with a larger proposal, the node rejects the modification.
We can see that this seems to be a "two-paragraph commit" optimization. In fact, 2pc/3pc is a distributed consistency algorithm of the error version, Google Chubby's author Mike Burrows said there is only one consistency algorithm in the world, that is Paxos, the other algorithms are defective.
We can also see that the proposed modification of the same value at different nodes is not a problem even if the receiver is received in a disorderly order.
For some examples, you can take a look at the "Paxos sample" section in Wikipedia Chinese and I won't say much here. For some examples of exceptions in the Paxos algorithm, you can deduce them yourself. You will find that basically there is no problem if more than half of the nodes are guaranteed to survive.
To say more, since Lamport published the Paxos algorithm in 1998, the various improvements to Paxos have never ceased, among them the most action is fast Paxos published in 2005. Regardless of the improvement, the focus remains on the tradeoffs between message latency and performance and throughput. In order to easily distinguish the two conceptually, the former is called the classic Paxos, and the latter is fast Paxos.
Summarize
From: Co-founder Ryan Barrett, Google App engine, spoke on Google I/O in 2009:
Earlier, we said that in order to have high availability of data, redundant data is required to write multiple copies. Writing multiple questions leads to consistency, and consistency issues can lead to performance issues. From what we can see, we're basically not going to make all the items green, that's the famous Cap theory: consistency, usability, partition tolerance, you can have two of them.
NWR model
Finally, I'd like to mention Amazon Dynamo's NWR model. This NWR model gives the option of the CAP to the user, letting the user choose which two of your cap.
The so-called NWR model. n stands for n backups, W represents a success to write at least W, and R indicates at least read R backup. Configuration requires W+r > N. Because W+r > N, so R > n-w What does this mean? Is that the number of copies to be read must be greater than the total number of backups minus the difference in multiples of the write success.
That is, each read is read at least to the latest version. So you don't read an old piece of data. When we need a high-writable environment, we can configure W = 1 if n=3 so r = 3. At this point it is considered successful to write any node successfully, but the data must be read from all nodes at the time of reading. If we ask for efficient reading, we can configure W=n r=1. This time any node reading success is considered successful, but when writing must write all three nodes success is considered successful.
Some of the settings of the NWR model can cause dirty data, because this is obviously not a strong consistent thing like Paxos, so it is possible that each read and write operation is not on the same node, so there are some nodes where the data is not the latest version, but the latest operation is done.
Therefore, Amazon Dynamo introduces the design of the data version. That is, if you read the version of the data is V1, after you have completed the calculation to backfill data, but found that the version number of the data has been updated to V2, then the server will reject you. Version This thing is like "optimistic lock".
However, for the distributed and NWR models, the version will have nightmares-the version of the problem, such as: we set the n=3 w=1, if a node received a value, the version of V1-V2, but has not been able to synchronize to Node B (asynchronous, should w= 1, write a copy even if successful), b node or V1 version, at this time, b node received write request, according to reason, he needs to refuse, but he did not know that other nodes have been updated to V2, on the other hand he can not refuse, because w=1, so write a point on the success. As a result, there was a serious version conflict.
Amazon's Dynamo has cleverly evaded the issue of version conflicts--the version rushed to the user to handle it.
So, Dynamo introduced the vector clock (vector bell!) to this design. This design allows each node to record its own version information, that is, for the same data, you need to record two things: 1) who updated me, 2) My version number is what.
Next, let's look at an action sequence:
- A write request was processed by node A for the first time. Node A adds a version of information (a,1). We take the data of this time as D1 (a,1). Then another request for the same key was treated by a and then there was D2 (a,2). At this time, D2 can be covered D1, there will be no conflict arising.
- Now that we assume that D2 is propagated to all nodes (b and C), the data received by B and C are not generated from the customer, but are copied to them by others, so they do not produce new version information, so now the data held by B and C is still D2 (a,2). So the data on the A,B,C and its version number are the same.
- If we have a new write request to the B node, then the B node generates the data D3 (a,2; b,1), meaning: Data D Global version number is 3,a two new, B rose once. Isn't this the so-called code version of log? If D3 does not propagate to C, then a request is processed by C, so the data on the C node is D4 (a,2; c,1).
- If D3 does not propagate to C, then a request is processed by C, so the data on the C node is D4 (a,2; c,1).
- Well, the best thing comes: If this time comes a read request, we have to remember that our w=1 is so r=n=3, so R will read from all three nodes, at this point, he will read three versions:
- A Junction: D2 (a,2)
- B Junction: D3 (a,2; b,1); C Junction: D4 (a,2; c,1)
- C Junction: D4 (a,2; c,1)
6. This time can be judged that D2 is already an old version (already included in the D3/D4), can be discarded.
7. But D3 and D4 are obvious version conflicts. The caller is then given the version conflict handling. Just like source code version management.
It is clear that the above dynamo is configured with a and p in the CAP.
In-depth analysis: transactional processing of distributed systems classic problems and models (reprint sharing)