Ext.: http://coolshell.cn/articles/10910.html
When we use a server on the production line to provide data services, I will encounter two problems as follows:
1) The performance of a single server is not sufficient to provide sufficient capacity to serve all network requests.
2) We are always afraid of our server downtime, resulting in service unavailability or data loss.
So we have to expand our servers, add more machines to share performance problems, and solve a single point of failure. In general, we will extend our data services in two ways:
1) Data partitioning : The data is partitioned on different servers (for example: UID% 16, consistent hash, etc.).
2) Data mirroring : Allow all servers to have the same data and provide equivalent service.
In the first case, we can not solve the problem of data loss, when a single server problems, there will be some data loss. Therefore, the high availability of data services can only be done in the second way-redundant storage of data (the general industry believes that more secure backups should be 3, such as: Hadoop and Dynamo). However, adding more machines will complicate our data services, especially across server transactions, that is, data consistency across servers . This is a difficult question. Let us use the most classic uses case: "A account to the B account remittance money" to illustrate, familiar with the RDBMS transaction know from account A to account B requires 6 operations:
- Read the balance from the a account.
- Do a subtraction operation on account A.
- Write the results back to the a account.
- Read the balance from the B account.
- Add operations to the B account.
- Write the results back to the B account.
In order to data consistency, these 6 things, either successfully completed, or not successful, and the process of the operation, A, B account of the other access must be locked, so-called locking is to exclude other read and write operations, or there will be dirty data problem, this is the transaction. Well, after we've added more machines, this thing gets complicated:
1) in the scheme of data partitioning : What if the data for A and B accounts are not on the same server? We need a cross-machine transaction. In other words, if the deduction of a is successful, but B's add money is not successful, we also have to roll back the operation of a. This becomes more complex in the case of cross-machine.
2) in the data mirroring scenario : The money transfer between A and B accounts can be done on a single machine, but don't forget that we have multiple machines with a copy of the A account and B account. What if there are two concurrent operations (to be remitted to B and C) for the remittance of the a account, and what happens if the two operations occur on two different servers? That is, in the data mirroring, on different servers on the same data on the write operation how to ensure consistency, ensure that the data do not conflict?
At the same time, we have to consider the performance factors, if the performance is not considered, the transaction is not difficult to ensure that the system is a little slower on the line. In addition to considering performance, we have to consider usability, that is, a machine is gone, data is not lost, and the service can continue to be provided by other machines. So, we need to focus on the following scenarios:
1) Disaster tolerance : Data not lost, node failover.
2) consistency of data : Transaction processing
3) Performance: Throughput, response time
As mentioned earlier, to solve the data is not lost, only through the data redundancy method, even if the data partition, each area also needs data redundancy processing. This is the copy of the data: when the data loss of a node can be read from the copy, the data copy is the only way for the distributed system to solve the data loss anomaly. So, in this article, for the sake of simplicity, we only discuss the problem of considering data consistency and performance in the case of data redundancy. Simply put:
1) to make the data highly available, you have to write multiple copies of the data.
2) Writing multiple copies of the problem can lead to data consistency issues.
3) Data consistency issues can lead to performance issues
This is software development, press the gourd up the scoop.
Consistency model
Speaking of data consistency, there are three simple types (of course, if the subdivision, there are many consistency models, such as: sequential consistency, FIFO consistency, session consistency, single-read consistency, single-write consistency, but for the sake of this simple and easy to read, I only say the following three kinds):
1)Weak Weak consistency : When you write a new value, the read operation may or may not be read on the copy of the data. For example: Some cache system, network game other players data and you do not have anything to do with the system, such as VoIP, or Baidu search engine (hehe).
2)Eventually final consistency : When you write a new value, you may not be able to read it, but after a certain time window it is guaranteed to eventually read it. For example: DNS, e-mail, Amazon S3,google search engine such as the system.
3)Strong strong consistency : Once the new data is written, the new value can be read at any time in any copy. For example: File system, Rdbms,azure table are strong consistency.
From these three consistent models, we can see that weak and eventually are generally asynchronous redundant, while strong is generally synchronous redundancy, and asynchronous usually means better performance, but it also means more complex state control. Synchronization means simplicity, but it also means performance degradation. OK, let's take a step-to-step look at what the technology is:
Master-slave
The first is the master-slave structure, and for this kind of refactoring, slave is generally a backup of master. In such a system, it is generally designed as follows:
1) Read and write requests are the responsibility of master.
2) When the write request is written to master, the master synchronizes to the slave.
From master to slave, you can either use async or synchronize, use master to push, or pull with slave. Usually it is slave to periodic pull, so it is the final consistency. The problem with this design is that if master collapses in the pull cycle, it can result in data loss on this time slice. If you don't want the data to be lost, slave can only be a read-only way to master recovery.
Of course, if you can tolerate the loss of data, you can immediately let slave instead of Master (for the compute-only nodes, there is no data consistency and data loss issues, Master-slave way to solve a single point of the problem) of course, Master Slave can also be strong consistency, such as: When we write Master, Master is responsible for writing himself, and so on, after the success, then write slave, both successful return to success, the whole process is synchronous, if write slave failed, then two ways, One is to mark the slave unavailable error and continue the service (such as slave after recovery Sync master data, you can have multiple slave, so one less, there is a backup, as previously said to write three copies), the other is to rollback themselves and return to write failure. (note: Generally do not first write slave, because if write master own failure, but also to roll back slave, at this time if rollback slave failure, you have to manually revise the data) you can see, if master-slave need to make strong consistency is how complex.
Master-master
Master-master, also known as multi-master, refers to a system that has two or more master, each of which provides a read-write service. This model is a master-slave version of the, data synchronization is generally through the asynchronous completion between master, so is the final consistency. The advantage of Master-master is that one master hangs up, the other master can do the read and write service, and he, like Master-slave, is lost when the data is not copied to the other master. Many databases support Master-master's replication mechanism.
In addition, if more than one master modifies the same data, the nightmare of this model arises--the merging of conflicts between the data is not an easy task. Look at the design of Dynamo's vector clock (the version number of the data and the person who modified it) to know that it is not that simple, and that the dynamo of the data conflict is handed to the user. Just like our SVN source conflict, conflicts with the same line of code can only be handled by the developer themselves. (Dynamo's vector Clock will be discussed later in this article)
Two/three Phase Commit
The abbreviation of this agreement is also called 2PC, Chinese is called two-phase commit. In a distributed system, each node, while aware of its own operation, succeeds or fails, does not know the success or failure of the operation of the other node. When a transaction spans multiple nodes, in order to maintain the acid nature of the transaction, it is necessary to introduce a component that acts as a coordinator to unify the operation results of all nodes (called contributors ) and ultimately indicate whether they want to commit the results of the operation ( such as writing updated data to disk, etc.). The two-phase commit algorithm is as follows:
First Stage :
- The coordinator asks all participant nodes if they can perform the commit operation.
- Each participant begins the preparation of the transaction execution: for example: Lock resources, reserve resources, write Undo/redo log ...
- The participant responds to the Coordinator, and if the preparation of the transaction succeeds, the response is "can be committed," otherwise the response is "reject submission".
Phase II :
- If all participants respond to "can submit", then the facilitator sends a "formal submit" command to all participants. Participants complete the formal submission, release all resources, and then respond to "done", and the facilitator ends the global Transaction by collecting "complete" responses from each node.
- If one participant responds with "Reject Commit," the coordinator sends a "rollback" to all participants, frees all resources, and then responds with "rollback complete", and the coordinator collects the "rollback" response from each node to cancel the global Transaction.
We can see that 2PC is the first stage to do vote, the second stage to make the decision of an algorithm, you can also see 2PC this thing is strong consistency algorithm. In the previous we discussed Master-slave's strong consistency strategy, which is somewhat similar to 2PC, except that 2PC is more conservative--try to commit again first. 2PC is used more, in some system design, a series of calls, such as: A, B, C and D, each step will allocate some resources or rewrite some data. For example, our online shopping on the order of the operation in the background there will be a series of processes to do. If we do that step by step, there will be such a problem, if a step can not be done, then each time the allocation of resources in front of the need to do the reverse operation to recycle them all, so, the operation is more complex. Many processing processes (Workflow) now use the 2PC algorithm to ensure that the entire process can be completed successfully using the process of try-and confirm. As a popular example, when a Western church marries, it has such a bridge segment:
1) The priest asks the groom and the bride separately: Are you willing to ... Regardless of sickness and death ... (Inquiry stage)
2) When both the groom and the bride answer their wishes (to lock a lifetime of resources), the priest will say: I declare you ... (Transaction submission)
This is how classic a two-phase commit transaction. In addition, we can see some of these problems, a) one of which is a synchronous blocking operation, this thing will inevitably greatly affect performance. B) Another major problem is on the timeout, for example,
1) If the participant did not receive a request for an inquiry during the first stage, or if the participant's response did not reach the coordinator. Then, the coordinator is required to do the timeout processing, once the timeout, can be considered a failure, you can retry.
2) If the second stage, after the formal submission, if some participants did not receive, or the participant submitted/Rollback after the confirmation message did not return, once the participant's response timed out, either retry, or mark the participant as the problem node to eliminate the entire cluster, so that the service node is data consistency.
3) In the worst case scenario, in the second phase, if the participant does not receive the Coordinator's Commit/fallback directive, the participant will be in the "status unknown" stage, and the participant is completely unaware of what to do, such as: If all participants complete the first phase of the reply (possibly all yes, probably all No, Maybe part Yes part no) if the coordinator hangs up at this time. Then all the nodes have no idea what to do (ask no other participants). For consistency, either death the coordinator or re-sends the first-stage yes/no command.
The biggest problem with two commits is the 3rd, if the participant does not receive a decision in the second order after the first phase is completed, then the data node will be "overwhelmed" and the state will block the entire transaction . In other words, the Coordinator coordinator is important for the completion of the transaction, and coordinator availability is a key. As a result, we introduced three paragraphs, and the three paragraphs presented on Wikipedia are described below, and he break the first paragraph of the two paragraph into two paragraphs: ask, and then lock the resources. Finally actually submitted. The three paragraphs are submitted as follows:
The core idea of a three-part submission is that you don't lock in resources when you ask, and only start locking resources unless everyone agrees.
Theoretically, if all nodes in the first phase return to success, then there is reason to believe that the probability of a successful commit is high. In this way, you can reduce the probability that the participant cohorts state is unknown. That is, once the participant receives the PRECOMMIT, it means that he knows that everyone agrees to the change. This is important. Let's take a look at the status migration diagram for 3PC: ( note the dashed lines in the figure, those f,t are Failuer or timeout, where the: state meaning is q–query,a–abort,w–wait,p–precommit,c– Commit)
From the state change graph we can see from the dashed line (those f,t is Failuer or timeout)-If the node is in the P-state (Precommit), the problem of f/t, the three-phase commit is better than the two-paragraph commit, Three commits can continue to turn the State directly into the C State (commit), while the two commits are overwhelmed .
In fact, the three-paragraph submission is a very complex matter, it is very difficult to achieve, and there are some problems.
See here, I believe you have a lot of problems, you must be thinking about the various failure scenarios in 2pc/3pc, you will find that timeout is a very difficult thing to deal with, because the network timeout in a lot of time to let you have nothing to do, you do not know whether the other side did or did not do. So you have a good state machine because of the timeout became a decoration .
A network service will have three statuses: 1) success,2) failure,3) Timeout, and the third is definitely a nightmare, especially when you need to maintain state .
Two generals problem (general question)
Two generals problem The question is such a question of thought: There are two armies, each of whom is headed by a general, who is now ready to attack a city built with fortifications. The two armies were stationed in the vicinity of the city, occupying a hill. A valley separates the two mountains, and the only way for the two generals to communicate is to send their messengers on both sides of the valley. Unfortunately, the valley has been occupied by the city's defenders, and there is a possibility that any messenger sent through the valley will be arrested. Please note that although two generals have reached a consensus on attacking the city, they have not reached a consensus on the attack time until their respective occupation of the hill. Two generals must let their troops attack the city at the same time to succeed. Therefore, they must communicate with each other to determine a time to attack, and agree to attack at that moment. If there is only one general to attack, then this will be a disastrous failure. This thinking experiment involves thinking about how they do it. Here's what we think:
1) The first general sends a message "Let's attack at 9 o'clock in the morning". However, once the messenger was dispatched, whether he had passed the valley, the first general was not known. Any uncertainty would cause the first general to attack and hesitate, because if the second general cannot attack at the same time, the garrison of that city will repel the attack of his army, causing his army to be destroyed.
2) Knowing this, the second general will need to send a confirmation reply: "I receive your email and will attack at 9 points." "But what if the messenger with the confirmation message gets caught?" So the second general will hesitate to confirm whether the message arrives.
3) So it seems that we have to send a confirmation message to the first general-"I received your confirmation". But what if the messenger is caught?
4) In this way, is it not that we want the second general to send a message confirming receipt of your confirmation.
So you will find that it quickly becomes a matter of how many confirmation messages are sent, there is no way to ensure that two generals have enough confidence that their messenger is not captured by the enemy.
There is no solution to this problem . Two general questions and its non-solution proofs were first published by E.a.akkoyunlu,k.ekanadham and R.v.huber in 1975 in the article "some restrictions and compromises in network communication design", as described in a paragraph on page 73rd of this article describing the communication between the two gangs. 1978, in Jim Gray's "Database Operating system Considerations" book (starting from page No. 465) was named the two general paradox. This reference is widely mentioned as the source of the definition of the two generals ' question and the proof of non-interpretation.
The experiment was intended to illustrate the challenge of trying to reconcile an action with a communication that is based on an unreliable connection.
In engineering, a practical solution to the problem of the two generals is to use a scheme that can withstand the unreliability of communication channels and not attempt to eliminate this unreliability, but reduce the reliability to an acceptable level. For example, the first general discharged 100 messengers and expected them to be less likely to be arrested. In this case, the first general will attack whether or not the second general will attack or receive any news. In addition, the first general can send a message flow, and the second general can send a confirmation message to each of these messages, so that if each message is received, two generals will feel better. However, we can see from the proof that neither of them is certain that the attack can be reconciled. They do not have algorithms available (for example, to attack by receiving more than 4 messages) to ensure that only one party is prevented from attacking. Moreover, the first general can also be numbered for each message, saying this is number 1th, number 2nd ... Until the n number. This method allows the second general to know how reliable the communication channel is, and returns the appropriate number of messages to ensure that the last message is received. If the channel is reliable, as long as a message is on the line, the rest will not be helpful. The probability of the last and first messages being lost is equal.
The question of the two generals can be extended to the more perverted Byzantine general question (Byzantine generals problem), whose story is this: Byzantium is in Istanbul, now in Turkey, the capital of the Eastern Roman Empire. Because of the vast territory of the Byzantine Empire at that time, for the purpose of defence, each army was separated very far, the general and the general can only rely on Messenger to preach news. At the time of the war, all the generals in the Byzantine army had to agree on a consensus and decide if there was a chance to win before attacking the enemy's camp. However, the army may have traitors and enemy spies, and these traitor generals will disrupt or sway the decision-making process. At this time, in the case of known members of the rebellion, the rest of the loyal generals without the influence of the traitor how to reach agreement, this is the question of Byzantine general.
Paxos algorithm
The description of the various Paxos algorithms on Wikipedia is very detailed, and you can go to the onlookers.
The problem with the Paxos algorithm is how to agree on a value in a distributed system where the above anomalies can occur, ensuring that no matter what happens above, the consistency of the resolution is not compromised. A typical scenario is that in a distributed database system, if the initial state of each node is consistent, each node executes the same sequence of operations, then they can finally get a consistent state. To ensure that each node executes the same sequence of commands, a "consistency algorithm" is executed on each instruction to ensure that the instructions seen by each node are consistent. A general consistency algorithm can be applied in many scenarios and is an important problem in distributed computing. The study of consistency algorithms has not stopped since the 1980s.
Notes: The Paxos algorithm is Leslie Lambert (Leslie Lamport, the "La" in LaTeX, which is now in Microsoft Research), which was presented in 1990 as a consistency algorithm based on message passing. Because the algorithm is difficult to understand at first did not attract people's attention, so that Lamport eight years later in 1998 to re-published to ACM Transactions on computer Systems (the part-time Parliament). Even so the Paxos algorithm has not been paid attention to, 2001 Lamport feel that the peer can not accept his sense of humor, and then in an easy to accept the way to re-express (Paxos made Simple). It can be seen that Lamport has a unique feeling for Paxos algorithm. In recent years, the general use of Paxos algorithm has proved its important position in distributed consistency algorithm. 2006 Google's three papers at the beginning of the "cloud", wherein the chubby lock service using Paxos as the Chubby cell consistency algorithm, Paxos's popularity from the road. (Lamport himself described in his blog that he spent 9 years publishing the algorithm of the past and the old)
Note: All cloud services in Amazon's AWS are based on an ALF (Async Lock Framework) framework that is used by the Paxos algorithm. When I was at Amazon, when I looked at the in-house share video, the designer said in the internal principle talk that he was referring to the zookeeper approach, but that he implemented the algorithm in another way that was easier to read than zookeeper.
In short, the purpose of Paxos is to allow the nodes of the entire cluster to agree on a value change. The Paxos algorithm is basically a democratically elected algorithm-most decisions will be a unified decision for the entire cluster. Any point can propose a proposal to modify a certain data, whether or not this proposal depends on whether there is more than half of the node's consent in the cluster (so the Paxos algorithm requires that the nodes in the cluster be singular).
This algorithm has two phases (assuming this has three nodes: A,b,c):
Phase I: Prepare stage
A request for modification is prepare to all node a,b,c. Note that the Paxos algorithm will have a sequence number (which you can think of as a proposal, which is constantly incrementing and unique, i.e. A and B cannot have the same proposal number), and the proposal will be issued with the request for modification, any node in the "Prepare phase". Will reject a request whose value is less than the current proposal number. Therefore, node A will need to bring a proposal number, and the more new one, the more the greater the proposal, when requesting a change to all the nodes.
If the receiving node receives the proposal number n greater than the other dots, the node responds to Yes (the latest approved proposal at this point) and is guaranteed not to accept other <n proposals. In this way, the node is always committed to the latest proposals in the prepare phase.
Optimization: In the above-mentioned prepare process, if any one of the nodes found a higher numbered proposal, you need to inform the sponsor and remind him to interrupt the proposal.
Phase II: Accept Phase
If the sponsor a receives a yes from more than half of the nodes, then he releases the accept Request to all the nodes (again, with the proposal number N), and if there are not more than half, it returns to failure.
When the node receives the accept request, if N is the largest for the received node, then it modifies the value and if it finds itself with a larger proposal, the node rejects the change.
We can see that this seems to be a "two-paragraph commit" optimization. In fact,2pc/3pc is a distributed consistency algorithm of the error version, Google Chubby's author Mike Burrows said there is only one consistency algorithm in the world, that is Paxos, the other algorithms are defective.
We can also see that the proposed modification of the same value at different nodes is not a problem even if the receiver is received in a disorderly order.
For some examples, you can take a look at the "Paxos sample" section in Wikipedia Chinese and I won't say much here. For some examples of exceptions in the Paxos algorithm, you can deduce them yourself. You will find that basically there is no problem if more than half of the nodes are guaranteed to survive.
To say more, since Lamport published the Paxos algorithm in 1998, the various improvements to Paxos have never ceased, among them the most action is fast Paxos published in 2005. Regardless of the improvement, the focus remains on the tradeoffs between message latency and performance and throughput. In order to easily distinguish the two conceptually, the former is called the classic Paxos, and the latter is fast Paxos.
Summary from: Google App Engine's co-founder Ryan Barrett's 2009 Google I/O speech "Transaction Across DataCenter" (Video:/HTTP/ WWW.YOUTUBE.COM/WATCH?V=SROGPXECBLK)
Earlier, we said that in order to have high availability of data, redundant data is required to write multiple copies. Writing multiple questions leads to consistency, and consistency issues can lead to performance issues. From what we can see, we're basically not going to make all the items green, that's the famous Cap theory: Consistency, usability, zoning tolerance, you just might want two of them.
NWR model
Finally, I'd like to mention Amazon Dynamo's NWR model. This NWR model gives the option of the CAP to the user, letting the user choose which two of your cap .
The so-called NWR model. n stands for n backups, W represents a success to write at least W, and R indicates at least read R backup. configuration requires w+r > N. Because W+r > N, so R > n-w What does this mean? Is that the number of copies to be read must be greater than the total number of backups minus the difference in multiples of the write success.
That is, each read is read at least to the latest version. So you don't read an old piece of data. When we need a high-writable environment, we can configure W = 1 if n=3 so r = 3. At this point it is considered successful to write any node successfully, but the data must be read from all nodes at the time of reading. If we ask for efficient reading, we can configure W=n r=1. This time any node reading success is considered successful, but when writing must write all three nodes success is considered successful.
Some of the settings of the NWR model can cause dirty data, because this is obviously not a strong consistent thing like Paxos, so it is possible that each read and write operation is not on the same node, so there are some nodes where the data is not the latest version, but the latest operation is done.
Therefore, Amazon Dynamo introduces the design of the data version. That is, if you read the version of the data is V1, after you have completed the calculation to backfill data, but found that the version number of the data has been updated to V2, then the server will reject you. Version This thing is like "optimistic lock".
However, for the distributed and NWR models, the version will have nightmares-the version of the problem, such as: we set the n=3 w=1, if a node received a value, the version of V1-V2, but has not been able to synchronize to Node B (asynchronous, should w= 1, write a copy even if successful), b node or V1 version, at this time, b node received write request, according to reason, he needs to refuse, but he did not know that other nodes have been updated to V2, on the other hand he can not refuse, because w=1, so write a point on the success. As a result, there was a serious version conflict.
Amazon's Dynamo has cleverly evaded the issue of version conflicts--the version rushed to the user to handle it.
So, Dynamo introduced the vector clock (vector bell!) to this design. This design allows each node to record its own version information, that is, for the same data, you need to record two things: 1) who updated me, 2) My version number is what.
Next, let's look at an action sequence:
1) A write request, which was processed by node A for the first time. Node A adds a version of information (a,1). We take the data of this time as D1 (a,1). Then another request for the same key was treated by a and then there was D2 (a,2). At this time, D2 can be covered D1, there will be no conflict arising.
2) Now that we assume that D2 is propagated to all nodes (b and C), the data received by B and C are not generated from the customer, but are copied to them by others, so they do not generate new version information, so now the data held by B and C is still D2 (a,2). So the data on the A,B,C and its version number are the same.
3) If we have a new write request to the B node, then the B node generates the data D3 (a,2; b,1), meaning: Data D Global version number is 3,a two new, B rose once. Isn't this the so-called code version of log?
4) If D3 does not propagate to C, then a request is processed by C, so the data on the C node is D4 (a,2; c,1).
5) Well, the most wonderful thing to come: If this time comes a read request, we have to remember that our w=1 so r=n=3, so R will read from all three nodes, at this time, he will read three versions:
- A Junction: D2 (a,2)
- B Junction: D3 (a,2; b,1);
- C Junction: D4 (a,2; c,1)
6) This time can be judged that D2 is already an old version (already included in the D3/D4), can be discarded.
7) But D3 and D4 are the obvious version conflicts. The caller is then given the version conflict handling. Just like source code version management.
It is clear that the above dynamo is configured with a and p in the CAP.
I am very much pushing everyone to look at this paper: "Dynamo:amazon's highly Available key-value Store", if the English pain, you can see the translation (Unknown translator).
(End of full text)
Transaction processing for distributed systems "go"