When we use a server to provide data services in the production line, I will encounter the following two problems:
1) One server is not performing enough to provide enough capacity to serve all network requests.
2) We are always afraid of this server downtime, resulting in service unavailable or data loss.
So we had to expand our server, add more machines to share performance issues, and solve single point of failure problems. Often, we extend our data services in two ways:
1) data partitioning: the data is divided on different servers (such as: uid% 16, consistent hash, etc.).
2) data mirroring: All servers have the same data, providing considerable service.
The first case, we can not solve the problem of data loss, a single server problems, there will be some data loss. Therefore, the high availability of data services can only be accomplished by the second method - redundant storage of data (generally, the industry believes that the number of safer backups should be 3, such as: Hadoop and Dynamo). However, adding more machines can complicate our data services, especially cross-server transactions, which are data consistency across servers. This is a difficult question. Let us use the most classic Use Case: "A account to B money transfer" to explain that all familiar with the RDBMS transaction from account A to account B need six operations:
Read out the balance from the A account. A account subtraction operation. Write the result back to A account. Read out the balance from the B account. B account to do addition operation. Write the result back to the B account.
In order to data consistency, these six things, either successfully done or not, and the operation of the process, A, B account of other access must be locked, the so-called lock is to exclude other read Write operation, otherwise there will be dirty data problems, this is the transaction. Well, after we've added more machines, things get complicated:
& http: //www.aliyun.com/zixun/aggregation/37954.html "> nbsp;
1) In the data partitioning scheme: What if A and B account data are not on the same server? We need a cross-machine transaction. In other words, if A's deduction money succeeded, but B's money is not successful, we have to A's operation to roll back. This has become more complicated across machines.
2) In the data mirroring scheme: The remittance between A and B accounts can be done on one machine, but do not forget that we have multiple machines with copies of A and B accounts. If there are two concurrent operations for sending money to account A (to remit B and C), what happens if the two operations occur on two different servers? In other words, in data mirroring, how to ensure the consistency of the same data write operation on different servers to ensure that data does not conflict?
At the same time, we have to consider the performance factors, if not consider the performance, the transaction is not difficult to guarantee, the system is slower on the line. In addition to considering the performance, we also consider the availability, that is, a machine gone, the data is not lost, the service can continue to provide by other machines. So, we need to focus on the following several situations:
1) Disaster Recovery: data is not lost, the node's Failover
2) data consistency: transaction processing
3) Performance: Throughput, response time
As mentioned earlier, to solve the data is not lost, only through the method of data redundancy, even if the data partition, each area also requires data redundancy. This is a copy of the data: when a node's data is lost, it can be read from the copy. The data copy is the only means for the distributed system to solve the data loss exception. So, in this article, for simplicity, we will only discuss the issue of considering data consistency and performance in the case of data redundancy. Briefly:
1) To make the data highly available, you have to write multiple copies of the data.
2) Writing multiple copies of the problem can lead to data consistency.
3) The problem of data consistency will lead to performance problems
This is the software development, press the gourd played a scoop.
Consistency model
Speaking of data consistency, there are three types for simplicity (of course, there are many consistency models, such as sequential consistency, FIFO consistency, session consistency, single read consistency, single write consistency, Sex, but for the sake of this article's easy to read, I only say the following three):
Weak Weak Consistency: When you write a new value, the read operation may or may not be readable on the data copy. For example: Some cache systems, network game other players and your data does not matter, such as VOIP system, or Baidu search engine (Oh).
2) Eventually final consistency: When you write a new value, it may not read, but after a certain time window to ensure that eventually be able to read out. For example: DNS, e-mail, Amazon S3, Google search engine such a system.
Strong Strong Consistency: Once new data is written, the new value can be read at any moment in any copy. For example: file system, RDBMS, Azure Table are strongly consistent.
From these three consistent models, we can see that Weak and Eventually are generally asynchronous and redundant, while Strong is generally synchronous and redundant. Asynchronous usually means better performance, but It also means more complicated state control. Synchronization means simple, but it also means performance degradation. Well, let's look at the techniques step by step:
Master-Slave
The first is the Master-Slave structure, for this configuration, Slave Master is usually a backup. In such a system, it is generally designed as follows:
1) read and write requests are Master.
2) After the write request is written to the master, it is synchronized by the master to the slave.
You can use async or sync from Master to Slave, you can use Master to push, or you can use Slave to pull. Usually Slave to periodically pull, so is the ultimate consistency. The problem with this design is that if the Master collapses during the pull cycle, it will result in the loss of data for this time slice. If you do not want to lose the data, Slave can only be a Read-Only way to restore the Master.
Of course, if you can tolerate the loss of data, you can immediately Slave instead of Master work (for the node is responsible only for calculation, there is no data consistency and data loss problem, Master-Slave way can solve the single point problem Of course, Master Slave can also be a strong consistency, for example: When we write Master, Master is responsible for writing their own first, and so on after the success, then write Slave, both successful return, the whole process is synchronized , If you write Slave failed, then two ways, one is marked Slave is not available error and continue to service (such as Slave synchronization recovery Master data, you can have multiple Slave, so one less, and backup, as in the front Say write three copies), and the other is to roll back and return to write failed. (Note: generally do not write Slave first, because if you write the Master himself failed, but also roll back Slave, if the roll back Slave failed, you have to manually correct the data) You can see that if the Master-Slave need to be made How complex is consistency.
Master-Master
Master-Master, also called Multi-master, means that there are two or more Master in one system, and each Master provides read-write service. This model is an enhanced version of Master-Slave, data synchronization is generally completed through the asynchronous between the Master, it is the ultimate consistency. The advantage of Master-Master is that one Master is hung up and the other Master can do normal read-write service normally. Like Master-Slave, data will be lost when data is not copied to other Master. Many databases support the Master-Master Replication mechanism.
In addition, if multiple masters modify the same data, the nightmare of the model emerges - conflict-merged data, which is not an easy task. Look at Dynamo's Vector Clock design (version number of the recorded data and the modifier) to know that this is not that simple, and Dynamo data conflicts to the user himself. Just as with our SVN source code conflicts, conflicts with the same source code can only be dealt with by the developer himself. (Dynamo's Vector Clock will be discussed later in this article)
Two / Three Phase Commit
The acronym for this agreement is also called 2PC, Chinese is called two phases to submit. In a distributed system, each node can know the success or failure of its operation, but can not know the success or failure of the operation of other nodes. When a transaction spans multiple nodes, in order to maintain the ACID properties of the transaction, it is necessary to introduce a component as a coordinator to uniformly control the operation results of all the nodes (called participants) and finally indicate whether these nodes should actually operate the results The submission (such as the updated data is written to disk, etc.). The algorithm submitted in two phases is as follows:
The first stage:
The coordinator will ask all the participants node whether they can perform the submit operation. Each participant starts preparations for the execution of the transaction: for example, resources are locked, resources are reserved, and undo / redo log is written. The participants respond to the coordinator, and if the preparation for the transaction succeeds, the response may be submitted, otherwise the response "Refused to submit."
second stage:
If all participants respond "can submit", the coordinator sends "formally submitted" order to all participants. The participant completes the formal submission, releases all the resources, and then responds with "done," and the coordinator concludes the Global Transaction by collecting "done" responses from each node. If one participant responds to "refuse to submit," then the coordinator sends a "roll back operation" to all of the participants, releases all resources, and then responds with "roll back done", coordinator collects rollbacks for each node "After responding, cancel this Global Transaction.
We can see that 2PC plainly is the first phase to do Vote, the second stage to make an algorithm decision, you can also see the 2PC is a strong consistency algorithm. In the previous section, we discussed Master-Slave's strong consistency strategy, somewhat similar to 2PC, except that 2PC is more conservative - try it first. 2PC is more, in some system design, a series of calls in series, such as: A -> B -> C -> D, each step will be allocated some resources or rewrite some data. For example, we B2C online shopping orders in the background there will be a series of processes need to be done. If we do it step by step, such a problem will arise. If a certain step can not be done, the resources allocated in front of each step need to be reversed to recover them all. Therefore, the operation is complicated. Many workflows now draw on the 2PC algorithm, using the try -> confirm process to ensure that the entire process can be successfully completed. Take a popular example, the Western church when married, there is such a bridge section:
1) The priest asks the groom and the bride, respectively, whether you are willing or not ... regardless of age or death ... (inquiry stage)
2) When both the bridegroom and bride respond to their wishes (lock in their resources for life), the priest will say: I announce you ... (Business Submission)
This is how classic a two-phase commit transaction. In addition, we can also see some of the problems, A) one of them is a synchronous blocking operation, this thing is bound to greatly affect the performance. B) Another major problem is on TimeOut, for example,
1) If in the first phase, the participant did not receive the inquiry request, or the participant's response did not reach the coordinator. Then, the coordinator need to do overtime processing, once overtime, can be regarded as failure, you can also try again.
2) If, in the second phase, after the formal submission was sent, if some participants did not receive it, or if the confirmation message submitted / returned by the participant did not return, once the participant's response times out, either retry or The participant is marked as a problem node to remove the entire cluster, which ensures that the service nodes are all data consistent.
3) In the bad situation, in the second phase, if the participant can not receive the coordinator's commit / fallback instruction, the participant will be in the "unknown state" stage, and the participant has no idea what to do. For example, if all After the player completes the first phase of the reply (possibly all yes, possibly all no, possibly part of the yes part no) if the coordinator hangs up at this time. Then all the nodes do not know what to do (asking other participants not to do). To be consistent, either die or wait for the coordinator to either reissue the yes / no command of the first phase.
If the first stage is completed, the participant does not receive the decision in the second stage, then the data node will enter the "overwhelmed" state, which will block the entire Business. In other words, Coordinator Coordinator for the completion of the transaction is very important, Coordinator availability is the key. Because of this, we introduce three paragraphs to submit, three paragraphs to submit the description on Wikipedia as follows. He breaks the first paragraph of the second paragraph into two paragraphs: Ask, and then lock resources. Finally submitted. Three paragraphs submitted as follows:
The core idea of the three-phase submission is that when asked, no resources are locked, and unless everyone agrees, resources are locked.
Theoretically, if all the nodes in the first phase return successfully, then there is reason to believe that the probability of successful submission is high. In this way, the probability of unknown status of participants Cohorts can be reduced. That is, once a participant receives PreCommit, he means he knows that everyone actually agreed to change. this point is very important. Let's take a look at the state transition diagram for 3PC: (Note the dashed lines in the figure, where F, T are Failuer or Timeout, where: State means q - Query, a - Abort, w - Wait, p - PreCommit, c - Commit)
From the state diagram above we can see from the dotted lines (those F, T are Failuer or Timeout) - F / T problems occur if the node is in P state (PreCommit) The benefits of paragraph submission is that three paragraphs can continue to submit the status directly into the C state (Commit), and two paragraphs are overwhelmed.
In fact, the submission of three paragraphs is a very complicated matter and it is quite difficult to realize. There are also some problems.
See here, I believe you have many, many problems, you must be thinking about a variety of failure scenarios in 2PC / 3PC, you will find Timeout is a very difficult thing to do, because Timeout network in many cases let you Do not do anything, you do not know each other did or did not do. So you a good state machine because Timeout has become a decoration.
A network service will have three states: 1) Success, 2) Failure, 3) Timeout, the third is absolutely nightmare, especially when you need to maintain state.
Two Generals Problem
Two Gen