Recently I learned paxos, an important algorithm in the distributed field. Here we will summarize the key points. Your own level is limited, and it is inevitable that there will be errors. please correct me. This article does not cover the basic theory of paxos. For details about paxos basics, refer to the following learning documents.
1 paxos illustration
Drawing summarizes the original paxos algorithm, mainly from paxos made simple. There is no leader and no optimization process, and Proposer/acceptor/learners are represented separately. This is just to make it easier to sort out the relationship between the three. We only need to know that the actual paxos project will overwrite the three.
Phase 1 Note:
1 When proposer sends a stage 1A message to a majority acceptor, the so-called majority may be all acceptor or only a set of acceptor greater than 50%.
Phase 2 1 may generate multiple masters. After several masters enter Phase 2, there may still be new obfuscators entering phase 1, and various stages of paxos can overlap with each other. That is to say, a proposer may be in phase 1 while another proposer is in phase 2, and the two proposer may also affect each other by accessing the same acceptor. The role of Proposer between phase 1 and phase 2 is completed by the dotted line in the figure: a proposer in phase 2 is given the (Num value) of an acceptor) assignment causes another proposer in phase 1 to discard its own value and use the value of proposer in Phase 2.
Phase 2 Note:
Q: If multiple masters have sent 2a messages, will future deterministic values be generated among the proposer?
A: No, of course, because there may be new spoiler joining phase 1 at any time, assuming that X (3) y (5) Two proposer in Phase 2 (3, 5 is its n value) the 2a phase message has been sent, but the turtle speed network causes x's package to be delayed on the road, while y's package only reaches one acceptor, and other packages are on the road. In this case, a new spoiler Z joins stage 1 and sends a larger N (assuming 6) Because N is greater than Max (max is 5, 6> 5) therefore, it quickly becomes qualified to enter stage 2 and uses its own value to submit the acceptor. After that, even if the response messages of X and Y arrive again, it will not help, because the max of the acceptor has been elevated by the Z Phase 1 message, the phase 2a messages of X and Y will be discarded by the acceptor. Of course, the probability of XY defeating Z is still very high, because as long as the XY 2a phase message can reach the acceptor in Z stage A1, it can seize more than half of the acceptor. In this way, more than half of the acceptor will block Z's stage 1A message (the path is the dotted line in the figure), forcing Z to not use its own value, but only accept the value set by Y.
Phase 3 Note:
1. The learning process may be out of order. Therefore, you must learn in ascending order by ID (that is, num of phase3 in the figure.
In the learning process of phase 2 and phase 3, there are multiple options for message transmission, such as simple multi-to-many, M acceptor and N learner, establishing M * n message links, and then sending messages. Or multiple to one, that is, all acceptors send messages to a fixed learner, and then the learner broadcasts the messages to all other learner. See <paxos made simple> section 2.4.
2. Learning Materials
I think reading a thesis is probably the best way to learn paxos. What should I do if I don't understand it? Replace one article with another. The following four papers discuss the principles and application scenarios of paxos from different aspects. Although the content is repeated, it also has great complementarity. Read these four articles carefully to have a comprehensive understanding of paxos. I think the <the part-time Parliament> that first published the paxos algorithm can be skipped or read as a historical document after fully understanding paxos.
1 <paxos made simple>(Original Chinese)
This is a simplified version of <the part-time Parliament> written by Lamport in 2001. I think this article is easy to understand. At the same time, the translation quality is very high, and the translator adds some comments to help you understand the translation. Here Lamport emphasizes the two-phase execution process of paxos, and the final phase 3 is not considered as an independent stage (Lamport uses the independent chapter 2.3 to introduce phase3 ), however, in other documents, paxos is considered as a three-stage process, and the 3rd stage is the learning stage. I personally think that three stages are easier to understand, so they are divided into three phases during the plotting.
2 <The chubby lock service for loosely-coupled distributed systems>(Original Chinese)
The author of this article proposes "there is only one consistency algorithm in the world, that is, paxos" Mike Burrows. He is also a pioneer in paxos in Google, and this chubby is also referenced as the first reference by Google's <paxos made live>. This Google engineering practice mainly introduces the application of paxos in Engineering. One of the two major uses of paxos is distributed locks. Google uses paxos to complete its internal Distributed Lock service chubby. This article focuses on three topics: 1. What is the Distributed Lock 2paxos? How to use it as a distributed lock? 3. How to solve various difficulties in engineering practice. We can also find out from this article how GFS and bigtable work with chubby.
3 <paxos made live>(Original Chinese)
This article is a summary of Google's internal use of paxos, which is more summative than Chubby, and also introduces the paxos algorithm itself. This article is written after the chubby article. The first article in its reference document is the chubby article. This article is suitable for reading as the first article after learning the principles of paxos. Paxos described from the engineer's perspective is more easily understood by engineers (you should pay attention to any engineering perspective words in the future, which may surprise you ). For example, when explaining why it is so difficult to implement a paxos system, they say, "although the paxos algorithm itself can be described with a single page of pseudo code, however, our complete implementation contains thousands of lines of C ++ code. Code expansion is not simply because we use C ++ to replace pseudocode, nor because of the tedious code style. To convert this algorithm into an actual one, product-level systems need to introduce a lot of features and optimizations (some are published research results, some are not ).". For paxos's usage of "reaching an agreement on a value", it is also clearly described as "in our system, we need) which of the following records is in the log. ", This kind of article is really refreshing!
4 <consensus on transaction commit>(Original Chinese)
An article co-authored by two Turing Award authors must not be missed. At the same time, the two theoretical founders introduced their own 2 PC and paxos, and then proposed a paxos submission algorithm that is crossbred between 2 PC and paxos. Through the performance comparison between normal and abnormal situations, it is concluded that "the two-phase commit is essentially the same as the paxos commit with only one coordinator ". This article can be a good article for learning 2 PC and paxos.
5 <cloth lock service>(Link)
This article is much smaller than the previous four articles. It is a domestic essay by college students. The author's team basically implemented a simplified chubby version, but it is interesting because the article comes from practice, it is worth reading.
3 paxos instance and Mutl-paxos
A paxos instance is a complete paxos running process. That is, execute phase1 ~ completely ~ 3 (if there is no optimization), submitting a value to paxos means that a paxos execution instance will be started when the value is submitted. The actual system will use paxos as the basis to implement consistency on a series of value values, such as distributing a log file one by one to a backup environment with multiple machines one by one, each distribution of a log may be a running paxos instance. Multiple paxos instances run continuously, namely Mutl-paxos. Mutl-paxos has several optimization methods. For more information, see <paxos made live> 4.2 and 5.2.
4. Select n
Proposer needs to select an incremental unique serial number. There is a very intuitive way: in a system with N replicas, first, assign an identifier between 0 and n-1 to each copy R. Copy R can select a minimum s greater than all its known serial numbers as the serial number, and ensure s mod n = IR. For example, in a 5-copy paxos system, you can create a serial number queue, 10, 15 for copy 1... create the serial number queue, 11, 16 for copy 2... similarly, when the proposer needs to increase the serial number, that is, extract one from its queue in sequence. This ensures that each proposer extracts the largest number in the world and does not repeat the number with others. For more information, see <paxos made live> section 4.1.
5 paxos Optimization
The basis for optimization is slightly different from the original paxos algorithm. It is assumed that all proposer requests are initiated by a fixed leader, selecting a leader as the only initiator of the proposal can prevent the live lock. The live lock is a message that each proposer completes phase 2a due to the continuous submission of a higher number of stage 1A requests. For more information about leader, see section 2.4 <paxos made simple>
1. The leader can send phase 2a messages to more than half of the acceptor to reduce the number of messages under normal conditions. As long as the leader receives the phase 2B message from more than half of the acceptor, it can know that V has been selected. If not, then, send the phase 2a message to the other acceptor. For more information about this method, see <consensus on transaction commit> section 4.1.
2. You can concatenate multiple paxos instances to reduce the number of messages. If the Coordinator (or leader) does not change during execution of multiple instances, messages of Phase 1a 1B can be ignored. To benefit from this optimization, the muti-paxos algorithm will be designed for a long time before the leader is elected (this is actually the basic idea of fast paxos, in fact, it is more intuitive, the election leader is not the main task of maintaining consistency, but to cope with exceptions. In many implementations, the election leader step is often simplified ). For more information about this method, see <paxos made simple> section 4.2.
3. The acceptor directly broadcasts the selected V to the learner, instead of sending it to the leader first, and then the leader forwards it to the learner. In this way, the message delay of phase 3 is eliminated at the cost of additional messages, similar to leader, after these processes receive phase 2B messages from more than half of the acceptor, they can get the selected value. For more information about this method, see <consensus on transaction commit> section 4.1.
6 paxos usage
I think the statement in <large-scale distributed storage system> is more accurate. paxos has two usage methods:
1. Implement the global lock service or the naming and configuration service, such as Google chubby and Apache zookeeper.
2. Copy user data to multiple data centers, such as Google's External Store and Google spanner.
The original paxos example shows the function of copying data to multiple backup machines. For Purpose 1, you can read the chubby paper to learn: first of all, you must understand that the paxos distributed lock is a built-in lock and the standalone mutex is a forced lock, the user is required to take the initiative to ask about the lock status before using the resource. We recommend that the lock and the resource be physically separated. Let's assume that the creation locks are the same as mutex, and there will be an ID-like identity. The ID of the distributed lock is not a meaningless identifier such as "123abc", but a UNIX file path format on the tall, from "123abc" to "/123/ABC ", in this way, the ID becomes very intuitive, because the ID can be used to directly express the function of the lock group and the application, the name "/projectx/groupA/Add" is much more straightforward than "123abc. This path also gives the lock a hierarchical parent-child relationship and makes it easy to introduce more features. You must understand that this representation is essentially not related to UNIX file systems, because "/" can be replaced by other symbols such as "|", such as "| projectx | groupA | app ", writing a backslash is completely intended to take care of the programmer's intuition and reduce learning costs (the chubby paper says, "Because the chubby namespace structure is similar to the file system ...... it also reduces the difficulty of training chubby users. "). The so-called lock ID representation is more elegant and should be called a namespace, which is similar to a programming language namespace: A class/variable method is finally located through hierarchical relationships.
After reading the namespace, how does paxos use it as a distributed lock? The basic purpose of paxos is to make multiple machines agree on a specific value. Suppose there is a paxos environment of a 5-backup machine, we call it 5 paxos. There is an external project xapp that uses 5paxos as the Distributed Lock. xapp wants to lock its own service http://www.xxx.com/make-id. then, it first sends a lockvalue to 5paxosand the requested path "/project/xapp/make-ID ". In this way, 5paxos will reach an agreement on the deterministic value ("/project/xapp/make-ID" is lock or unlock). The current consistency is lock. Only when the xapp sends an unlock again, the 5paxos will reach an agreement on the new value unlock. Because the locks are built, therefore, any service in xapp that uses this make-ID interface must actively check whether the lock status of "/project/xapp/make-ID" stored in 5paxos is lock or unlock before use. And comply with the "do not access this service when unlock" gentleman agreement, so that the role of the lock is finally reached. The biggest advantage of being a distributed lock is high availability. The failure of two of the five backup machines in 5paxos does not affect the normal operation of the lock.
Paxos learning Summary