Rethinking the Paxos algorithm

Source: Internet
Author: User

  

1. Background

In the study of distributed knowledge, a long time ago on the intermittent understanding of the Paxos algorithm, the search for information such as Paxos made Simple translation version, each other reprint, translation, mutual annotation. After the rough layout, some strange things were found, such as "proposal", "election", "Senator", "Resolution". Heart can't help dozens, this is some god horse thing ah, and distributed fault-tolerant has five cents relationship? As a simple code farmer who lived in a socialist country for a long time, it was suddenly distorted by these words. Bite the bullet back to read, found that encountered doubts and sudden a "obvious", really have bitter nowhere to sue.

This is good, if you see other authors take out the Three Kingdoms period of the five tigers on the future example, sang up a story, voluminous thousands of words flip hand cramp, and your mind also to these and above the noun again mapping, Jiuxian Taibai also drunk it?

The idea was that a formal, rigorous description would not be so painful. Read the entire article, do not know what it to prove, to achieve what purpose, solve the problem. Since reading is not so easy to understand, there is bound to be a variety of Daniel to write all kinds of notes, so again with the key word "thoroughly read", "Deep analysis" search related articles, sure enough to find an article: Paxos algorithm in-depth analysis.

After reading this article, I almost understand, at this time and then look at Paxos madesimple, plus points of their own thinking, and finally will not be around dizzy.

However, watching the masters of the obvious place to omit the confusion caused by the rookie of the generation will be passed down, think here, can't help my dissertation, vowed to change this situation, so with this article. Read this article, if you do not understand what Paxos or Paxos can do, I declare out of the lake.

In fact, the Paxos algorithm itself is not complex, and strive to let the reader also agree with this point of view, this is one of the topics of this article. But what about the logic behind this not-so-complicated idea? How is it formed? This is the article to focus on another topic. Although there are a lot of words, I hope to bring you a relaxing and enjoyable reading experience.

2. I think, therefore, in

In fact, the whole idea of the algorithm described inPaxos made simple article is to start with the goal of the algorithm to reach, to gradually reverse the implementation of the algorithm needs to meet the constraints, and finally according to the action of each stage and its constraints to make a final summary of the process.

In the process of backward pushing, like playing a game of thinking, you can open up your imagination and move forward without worrying about logic loss. Normally, you will find that the current action is missing the necessary advance agreement, when the game resets, strengthens the agreement and then starts again, so repeatedly until the expectation is reached. Yes, I can get it back when the moral integrity is gone.

In view of the missing details of Paxos made simple, this article will then add its own understanding, focusing on the process of restatement. However, when someone is holding a nose walk but don't know why, the heart will be very uncomfortable. I suggest that you think about yourself before referring to other people's thinking, in contrast, until you agree from the bottom of your heart, or come up with a strong rebuttal.

3. The purpose of the algorithm

The purpose of the algorithm is to achieve read-write consistency in a distributed system with fault-tolerant capability. "Consistency" is clearly the core of the demand, simply put, this distributed system as a read-write serial single node, so write X will be able to read the success of x, write y must be able to read the success of Y, and not some nodes read X, some nodes and read Y, on the other hand, must have fault tolerance, Specific to this algorithm, as long as the majority of the nodes survive, then the system will work properly.

Fault tolerance is high availability, high availability is usually achieved through redundant replicas, and redundancy means no absolute consistency; a single copy is naturally consistent, but a failure is not available. High availability and consistency can you have your cake and eat it? The answer is yes. There is a special theorem that describes the problem in a more comprehensive way, that is, the cap theorem , which of course confuses thousands of teenage boys and girls, perhaps this paragraph is flawed, but its related exposition left to tell, here do not be investigated too much.

More specifically, from an implementation point of view, the goal to be achieved:

When implemented, a fixed number of servers are used, and each serve is ready to accept client requests. When multiple clients send their own request value (Value_i) to one of the server processing, eventually each server to the client reply is a consistent value (Chosen_value), this value is one of value_i, And as long as the majority of the server is alive, you can request any server again at any time to know the value (Chose_value).

Now you know, the core of this article is the basic Paxos algorithm, which means that once the write is complete, subsequent writes will only be written to the same value, which is equivalent to writing only once. As to how to support write multiple (update to other values), this algorithm is of course the foundation, confined to space, and no one to pay, whether in this article elaborated Multi-paxos completely look at the mood.

If you know the cap theorem, you should know that the CA and P cannot be combined, the server that can return to normal response must be connected with the other server, otherwise it is considered that the server is not viable.

The Paxos algorithm, which can solve the above problems, is an algorithm based on the final agreement of message delivery between the server groups. The transmission of the message allows for loss, repetition, and confusion, but does not take into account the circumstances in which the message was clenched.

Why can it be assumed that the message content will not be tampered with in a non-Byzantine model? In fact, the message in the transmission, storage process can not only be manipulated, but also because of the nano-level chip technology further refinement of the environment caused by the sensitivity of the hardware, resulting in a bit reversal and other difficult to detect tragic accidents.

If it is to be considered that the message could be tampered with, it is also the subject of another article, the Byzantine general question. That is to say, a stable and secure message transmission channel, and the consistency algorithm discussed here can be independent parallel, the respective description.

Here is also a brief talk about how to solve the problem of message tampering. First of all, the human problem is ignored, it can be assumed that the algorithm runs in a relatively safe intranet environment. In the solution of non-human data confusion, the checksum is the most economical and effective solution, the TCP protocol is so, but it can only reduce the error probability, but not 100% to eliminate. It is said that Google has happened to the TCP protocol under the flip of things, since then, their RPC protocol on TCP has added a layer of application layer validation mechanism.

4. Backward-Pushing process

The client makes a request to write value. The client may have at the same time m: Client A said to write Value1,clientb said to write Value2, ...

What happens when the server node (hereafter referred to as the node) receives the request? How to Select a value (Choose) out of value (Note ' Select this term, very important) so that once the value is selected , then at any point after that time, As long as more than half of the nodes survive (disaster tolerant), write requests can only write the same value (write consistency) to return success, other value writes are denied, and any read request can read the value (read consistency). This selected Value, there is a special noun, called chosen_value.

If there is only one node, it is easy to do so. There are countless mechanisms for serialization requests, either, and then use a simple first-to-first selection strategy. Assuming that client A's request is received first, then the server node chooses Value1, the subsequent Value if it is not Value1 refused, Value1 at this time become chosen_value. Once the chosen_value is generated, at any later point in time, the client's read request can be read to Value1, and the write request is only successful if it is written Value1.

However, once the only node fails, it is unable to provide services, even leading to permanent loss of data, not satisfying the "disaster recovery" feature. So, nodes are bound to be multiple .

The only way to disaster-tolerant data is to have multiple copies, simple and effective, and Paxos can not be mundane.

So, the question is, how many nodes do you need to select the Chosen_value (Chosen_value need to exist on how many nodes)?

We know that the purpose of the Paxos algorithm is to still provide read and write services in cases where only a majority of the nodes are alive. Well, let's assume that the cluster (the collection of nodes) is running for the first time, and then, in addition to the existence of just over half of the nodes, all the other nodes are lost in the case of a single piece of white paper. At this point, M-client began to initiate write requests at the same time, and we do not know the Magic Paxos algorithm began to operate between nodes, after a period of time, the algorithm terminates, the dust settles, chosen_value birth.

The so-called Wuqiaobuchengshu, after the birth of Chosen_value, the loss of the group of nodes and ghosts general resurrection, but God and the majority of the survival of the previous half opened a joke, they all hung up, only one (accurate: odd nodes of the cluster left 1, even nodes of the cluster left 2), At this time the resurrection of the White Paper node plus running the Paxos algorithm of the only (or only two) surviving nodes together to form another half. At this point, the read-write service should not be affected. It is clear that the burden of consistency and the white nodes do not have a half-dime relationship, so the only surviving node must have traces of chosen_value. Since the surviving node is arbitrarily selected in more than half of the surviving nodes, any surviving node must have the consciousness to bear this burden, i.e. the Chosen_value must meet the following two constraints:

A. Once the Chosen_value is selected (you will see later, Chosen_value is chosen does not need a node to identify, as long as the objective condition of the chosen_value is born), at least need to exist a set of a majority of nodes, Each node in the collection has clues to find the chosen_value. (Disaster tolerance)
B. Chosen_value has and only one. Consistency

How to make this choice decision for the same value by the majority of the nodes, while the other nodes are either not making a choice or choosing the same value?

This is hard to do because the choice means to "identify" the value as Chosen_value. More than half of the nodes at the same time to identify more is not possible, if one to identify, the face of the request to identify the node + has been identified after the node hangs, the subsequent survival of the majority of the chosen_value can not be known before the fact that the inconsistency of the monster appeared again.

A change of mind, if you can guarantee that the "identified" node is not more than a majority (objectively, do not need a node to inquire and statistics), the value of this determination to allow different, and in the cognizance of the same value just a majority (objectively exist, you go to identify, and then once objectively reached a majority) from the moment, Follow-up can only be identified with the same value, then even at that moment from the node that initiated the identification request and have been identified as part of the node (certainly not all, because the identified node is more than half, if all, the remaining nodes are not more than half, Paxos failure is taken for granted), It doesn't matter, because whether you admit it or not, this value is Chosen_value, it has left enough evidence on more than half of the nodes.

It is necessary to identify a new term-"approval".

It was so important that I could not emphasize it too much, so I read the conclusion from another angle. A node approves (determines) a value, and a majority of the node approves a value, then hangs out only one, and now in the surviving node both are approved by only one node, then what is the difference between the two now? There are essential differences:

其一,过半数可以保证在过半数存活的情况下,至少有一个节点知道当时的真相——批准了谁,从而有机会保证数据的一致性。如果是前者,唯一批准的那个节点挂掉后,就没有人再知道那个Value了其二,后者出现过过半数,这个客观事实很重要,直接触达paxos的心脏。请继续看下文到底如何利用这个客观事实。

Since some value is approved, it may not be destined to become a chosen_value, the node that has approved this value will no longer accept the approval of other value, Then it is possible to form an objective fact that a value is approved by most nodes (different value is approved by different nodes respectively).

It is easy to conclude that a node must approve multiple value, and that the node approves value requires an approval policy.

When the value proposed by the node is approved, the node may have gone to the sheep, and its value is in the discarded column. In order for the algorithm to continue, it is necessary to keep the node re-proposed value, a very obvious strategy is to approve the latest value. Who is new who is old, so value must also be accompanied by a similar time identification, can distinguish who is new who old, a very easy to think of the scheme is to give each value number. A request for approval is subsequently called an accept request, and the content of the accept request requires the node to approve a value--called a "motion". So the bill contains {number, Value}. The node that approves the accept request, the role now played is called Acceptor.

Tips: The motion is {number, value} pair, numbered as the unique identifier of the motion, and the value of the different motions may be the same.

When all nodes start approving the latest value, two different value will be approved by a majority: Value1 first sent to the majority of the node approval, and then Value2 (updated value) again, it is clear that more than a majority of approval.

How to solve this contradiction? It is easy to think that if we can make sure that Value1 and Value2 are equal, then the consistency remains complacent. That is, once a motion is established by the objective fact that a majority of the nodes have been approved, then the value of the subsequent motion must be consistent with it.

How do we do that? Please continue to see below.

The node that proposed the proposal is obviously receives the client request node, how can guarantee two different nodes to propose the same value motion? Unification means coordination, and there must be coordination between the nodes that propose the motions.

Only after a certain motion has been approved by a majority of the objective facts, the subsequent motions need to maintain the value of the same, so if before the motion is proposed to initiate a request to ask a majority of nodes-this request is called the prepare request: Each node is currently the most recent (number of the largest) the approved motion is what, It seems that we have seen the dawn.

From the moment of the motion, if at this time more than half of the nodes have approved a certain motion (please open God's perspective to observe, it exists objectively), then the prepare request from the majority of the results received will inevitably be informed of it (and the drawer principle, The majority of the acceptor nodes that receive the prepare request and the majority of the acceptor nodes that approve the accept request must have an intersection), and we should select it as the value of the new motion. Then who is it? How to Tell? We have now received a pile of approved motions, the contents of which are {number, value}, which is the number of the selected value? Who is the bill that has been approved by a majority?

In another case, from the moment of the motion, if there is no motion at this time half approved (please continue to open the view of God), then prepare request from a majority of the results received may still return a pile of approved motions, the content of the bill is {number, Value}, Who should we choose at this time?

Wait, I want to be quiet, and don't ask me who is quiet.

Since there are two cases in which the returned approved motions may not have been approved by a majority or may have been approved by a majority, then we need to draw up a rule to find a motion from a pile of motions, either a motion that has been approved by a majority, or a motion that may be approved by a majority in the future, in the face of the {number, Value}, obviously only from the number, then the rules are clear: the number of the largest one.

One might say, why not ask all the nodes? And then return to the approved bill if one of the motions repeats more than half, is that it? However, the days of misfortune, any node may be hung at any time, if the majority of the approved nodes in any of the hanging off one, the above half can not be satisfied.

In the majority of replies to the prepare request, we select the value of the most-numbered motion in the approved motion (select value in the client's request if it is not approved), and the new bill must have a number (the number must be incremented, and each node is not duplicated. A very simple solution is to separate the numbers between each node and increment each one independently.

So the problem is solved? How to ensure that the bill with the largest number is approved by a majority of the other number of motions not approved? Imagine a situation, a node moved a motion, found that no bill was approved, so it chose its own value Value1, the motion number is 13, then another node proposed a motion, prepare asked a circle after still no motion was approved, so it also chose their own value Value2, Bill number 250. At this time the motion {13,value1} was approved by a majority of nodes, and Choose_value was born. Then because the node to approve the latest motion, so the motion {250,value2} Obviously can also be approved by a majority, crouching trough! Another two value is approved by a majority!

What's the problem? When motions 13 and 250 were proposed, no motion had yet been approved, but then the 2 motions could be approved, which was unacceptable and should therefore be strengthened. Only one bill can be allowed to be approved by a majority, and according to the rules it is clear that we have agreed on 250 (the largest number of motions so far, which can be seen by opening God's eyes). Therefore, in the motion 250 of the prepare request, should bury the motion 13 (actually is smaller than 250 of any motion) cannot be approved by a majority of the foreshadowing, namely prepare request, ACCEPOTR node can not only return the latest approved motion, but also to give a promise, No longer approves the small bill (prepare's bill number).

As the Pepare is also issued to a majority, and the motion 13 to be approved also need to be sent to a majority, they must have the intersection, the intersection of the node because of the commitment, will not pass the approval of the motion 13, so the motion 13 in the moment proposed in motion 250 is doomed to not be approved by a majority, The inconsistency is stifled in the cradle, which is the most subtle of the whole algorithm.

In simple terms, the algorithm process is as follows:

Stage 1.
The role of the node receiving the client request is called Propser (as soon as a motion is proposed), a new motion number is selected, a prepare request is made to the acceptor (with the number of the upcoming motion), and the latest approved bill is asked.

After receiving the prepare request, Acceptor returns the latest approved bill (number +value) and commits to no longer approve the number less than the motion in the prepare request.

Stage 2.
Proposer receive a majority of acceptor responses, you can start Stage 2

Propser Select Stage 1 The value of the largest motion in all replies (if not, select value in the client request) as value of the current new motion, initiate an accept request to acceptor (including itself) and ask if the bill can be approved.

Acceptor received the acceptance request, first check the number of their commitments, if the current bill is less than the number of commitments, then refused to approve the motion, otherwise passed (there is a constraint is only approved the updated motion, in fact, has been included in the commitment, that is, if a small ( Earlier) numbered motions appeared, meaning that larger (newer) motions had been born, and the promise of greater birth had ensured that the smaller would not be passed by a majority, then even if the small number of bills passed here would have no effect on the overall situation.

The rest of the matter is Proposer check acceptor reply, if received a majority of reply said the motion passed, then can be sure that the Value of this motion is chosen_value.

Even if proposer has not received a majority of the reply before hanging off, and received the request of the acceptor still conscientious in approving the motion, once the objective of the approval reached a majority, proposer own hanging off is insignificant. Because we can at any time, another round of such a process, it is necessary to re-learn the approved value. This is where Paxos strong consistency lies. The node hangs casually, as long as the majority lives is OK.

5. Concluding remarks

In fact, there is a learner process, the simulation Propser proposed a motion, if the return of the approved motion, then the initiation of the accept request, if received a majority of the reply, stating that Chosen_value has been born, is the value of the motion it initiated, This value can then be widely disseminated, and this round of basic Paxos can be declared over, and can receive the next round of updates. The process of transition from basic Paxos to multi Paxos is almost the same. (This paragraph is purely "think too much, read too little" product, I have not studied carefully, but want to sorta. )

In addition, to achieve Paxos, there are many areas to be considered and optimized, such as the previous discussion, if there is a higher number of motions, it is possible that no motion can be approved by a majority of nodes, so that the Paxos algorithm will not terminate, read here, I think you've been able to study the rest of it on your own.

Finally, this article describes the most likely to have the confusion of the place, please self-control other relevant literature to pseudo-elite, this article is a catalyst, inevitably the home of the mussel smile ~.

Rethinking the Paxos algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.