Paxos Algorithm for Distributed system

Source: Internet
Author: User
Tags prepare

Paxos is an algorithm that can reliably and consistently achieve consensus consistency based on a large set of completely unreliable network conditions. That is, it allows a group of not-so-reliable processors (servers) to agree on a secure consensus if certain conditions are met, and to ensure that the set of processors (servers) are consistent if conditions are not met. what is consensus.

Specifically: in distributed systems due to communication between networks may be interrupted, although the probability is low, but no 100% perfect network Therefore, it is more difficult to reach a consensus between computers that rely on network communication, assuming that x, Y and Z computers plan to attack the human world in Monday, Their attack plan is that as long as all computers can be used to attack together, not to drop any machine, but when they decide exactly when to start the attack, on this critical issue often error.

A basic problem is that each machine has its own attack time proposal, computer X can be suggested at 08:00, because this time is Monday morning, and people just after the carnival weekend rest days, but computer Z think 13:00 is better, the reason is certainly a lot of, It is very difficult to make these computers based on a consensus at some point, so it also leaves a chance for the human to fight back.

In another case, the three computers are located in different locations around the world, communications may be communicated via cables or other unreliable network devices, and if X is recommended at 08:00, it must confirm that this recommendation is able to reach the living Y and Z, lest one man fight.

The problem is: we don't know exactly why a computer's latency is due to slow performance or a complete crash.

So how does x know that the other two computers are available? That is, it takes a long time for X and two other computer traffic to find a response. It is not sure whether the two computers can continue to live, perhaps the communication has been delayed, the next time they live again without delay, perhaps the next delay a little longer, perhaps the next delay a little bit shorter, These random probability problems make it impossible for X to determine what the problem of Y and Z is causing the delay, because it handles a particular CPU-intensive task or a deadlock. Of course, some naïve designers would say, as long as we have the performance monitoring in place, if the delay exceeds a certain amount of time, we manually intervene to tell x the exact situation, then the artificial intervention of the distributed system is not a natural automated system, not within the scope of our discussion, and such a system will make people tired.

Since x is not sure whether Y or Z is available, X can only agree with one of Y and Z based on the time of attack, and there is no plan for all three machines to be fully engaged in battle. Note that any one of the three X Y Z may have a delay, this caused the three machines between any communication is not reliable, such as x sent a message to z,z confirmation after the receipt to X, but this time x suddenly panic, then Z to wait for x how long to know it received confirmation. Or wait for the confirmation of the X reply confirmation, so the infinite loop can not solve the problem that the communication between them may be random and unreliable.

All that matters is that because the communication between the three machines is not guaranteed to be 100% reliable, they cannot agree on anything.

The following is a distributed auction case to illustrate this and the rationale for Paxos.

In the traditional auction scene, the high bidder first, these auctioneers are in the same room, each other can directly see each other's offer, if we assume that the distributed auction is to separate these auctioneers to different places, so we can use the contact between the auctioneer to simulate the communication between distributed computers.

Suppose the auctioneer is auctioning each other at his own house, sending out his own auction information through a post office letter, and the auctioneer will not be able to know the offer unless the Post Office sender tells them the offer. If the Post Office mail delivery of this link out of the problem, delivery speed is slow or even can not deliver, then the whole auction process can not continue to go on.

Paxos The idea of solving consensus

Paxos is a solution to the consensus problem consensus problem algorithm, the realization of Paxos in reality as well as the heart of some world-class software, such as Cassandra, Google's spanner database, distributed lock service Chubby. A system managed by Paxos is actually talking about value states and tracking issues, with the goal of building more high-availability and strong consistency distributed systems.

Paxos completes a write operation requires two times back and forth, respectively, Prepare/promise, and propose/accept:

The first time a prepare message request is prepared by the submitter leader to all other servers, most of the servers are ready to be accepted if they reply to a promise, and the second submitter makes a formal recommendation to all servers propose, Most of the servers are successful if the replies have been received.

To describe this two-part process in detail, let's first define some of the terms we'll use: a process is one of the computers in the system. People use words such as copying or nodes to express themselves. A client is a computer that belongs to a member of the system, but asks what the system value is or asks the system to get a new value.

Paxos build a small fragment of a distributed database: it simply implements the process of writing a new thing precisely into the system, the process is governed by an instance of Paxos, can fail or not know anything, or most processes know a value of the same, which is the consensus, Paxos doesn't really tell us how to use it to build a database or something like that, it is a process that is responsible for the communication between independent nodes, which executes decisions based on a new value, and Paxos stores a value data, just once, once you have set it up for the first time.

Paxo Read Operation

In fact, Paxos essence is in the write operation, will read the operation in front of the write operation, is to focus on Paxos most of the server to reach a consensus as an important sign, through reading to determine whether consensus reached this state.

In order to read a value data from the Paxos system, the client requests to read the current value stored in all processes and then obtains this value from most process servers, if the number is not enough or if there is not enough client response, the read operation fails. Below you will see a client asking the other nodes what their values are, the nodes returning the value to the client, and when the client gets the response from most of the nodes, the return value is the same, it succeeds in reading the operation, and saves the read results.

This is somewhat odd compared to a single node system (only one server), in which the client needs to observe the system's determined state, but in a distributed system, like MySQL or a memcached server, the client simply takes the state to the server address of the standard state store. In the simple Paxos, the client also observes the state in the same way, but because there is no server address for the standard state store, it needs to ask all the members so that it can be determined that only one will report the value data, which is actually the value data that most nodes hold, if the client asks for a node, It is possible that this node process has expired and the wrong value data has been obtained, there are many reasons for a process failure to expire: because unreliable networks have lost the messages that should have been delivered to them, or they may have used an expired state recovery, or the algorithm is still running, and the process is not exactly getting the message, and so on. In reality, when using Paxos implementations, there is no need for each node to read once, there will be a better way to read, but they are expanding the original Paxos algorithm.

Paxos Write Operation

When a client asks for a new value to be written to the system, let's look at what Paxos has done for our cluster process. The following procedure is written with only one value, and eventually we can use this process as raw data, allowing value data to be set up in each other, but the basic Paxos algorithm governs the write process of a new value data and then does the repetitive thing.

First Paxos management system in a client request to write a new value, the client here as shown in the red circle, the other process is a blue circle, Paxos can ensure that the client sends their write requests to any member of the Paxos cluster, where the client randomly selects any one of the processes in the demo, This approach is important and ingenious, meaning that there is no single point of risk, which means that our Paxos governance system can continue to be available online, regardless of whether any one node is machine or otherwise unavailable. If we design a particular node as the "referrer proposer" or "the Master", if the main node crashes, the entire system crashes.

When a write request is accepted, the Paxos process accepts the write new value to the system to request "suggestion", which is a formal concept in Paxos: A system proposal to a Paxos governance may succeed or fail, and steps are needed to ensure that the consensus can be sustained, This recommendation is sent to the entire system to prepare messages from those process nodes that are connected to the client. Serial Number

This preparation message is stored in the proposed value data, also known as serial number sequence, which is generated by the recommended process and defines that the acceptance process should be prepared to accept a recommendation with a serial number, which is the key: it is used to indicate the difference between the old and new recommendations, If two processes try to get the need to set a value, Paxos that the last process should have precedence so that the process can tell which is the last, so that it sets the most recent value.

These accepted processes are able to perform critical checks in the system: this is the highest level I have ever seen in the incoming preparation message. If it is, it's cool, I'm ready to accept the coming value data, so don't take care of any other value data you've heard before, you can see the process in the right demo: The client proposes a new value to a process every other time, and the process sends a message to the other process. Then those processes noticed that it was a success higher than the old new serial number, and then let go of those old suggestions.

Here is a sequential design (send the preparation message first), this is to avoid a single point of risk, if not in this order, the members of the Paxos can not tell which advice they can confidently prepare to accept.

We can't imagine a different consensus algorithm, not following the steps: First send the first message to ask the other process to make sure that the new value you set is the most current value, although the way can be simpler, it may not meet the need for consensus algorithm security, if two processes are exactly the same as recommending different values, as follows:

Nature often deceives us like this, each package can be the other half of the process believe that they accept may be correct or wrong value, the system will enter a deadlock, there are two of the same number of groups but there are different values, then you can not determine the majority of this concept, this deadlock can be the first Paxos message to avoid, Because Paxos's serial number, those with the problem of the proposal will have the other lower serial number, so that the higher serial number of recommendations will be unambiguously received by most processes, they will first obtain a higher serial number, and then accept a lower serial number will be rejected, Paxos This is the way to control the time rhythm of the whole system by using serial numbers. The above illustration shows that the client first sends a prepared message to the Paxos process, and the Paxos process checks the next proposed serial number to see if it is ready to accept the new value, and all processes are resolved in such a way that consensus is reached.

Note: It is important to ensure that no two recommendations use the same serial number. This is to ensure their order, so that each serial number has only one suggestion, so as to compare two recommendations, when implementing Paxos, the globally unique sequential serial number is actually a copy of the exact system time and the number of nodes in the cluster, increasing over time, Never repeat.


Paxos First stage: Prepare Perpare/promise promises

The first phase of the Paxos is prepare/promise, and the preparation phase is to send the recommended values to each target node.

When the recommendation is sent to the target node, the process examines the proposed serial number, whether it is the highest level they have ever seen, and if it is the most advanced, it sends a promise that no longer accepts the older proposal than the new serial number, This promise promise as a message from the process of making a promise to the process server submitting the proposed new value, this promise message to the process of submitting the proposal, the process of submitting the proposal requires itself to count how many other processes have sent back their promises promise, if the number is judged to be the majority. If most processes have agreed to accept this recommendation or the recommendation of a more advanced serial number, the process of submitting the proposal will know that it has gained a say (because of the majority of support), so that the next step in the algorithm will be possible; if the number of nodes replying to the promise is not the majority, That is, the consensus has not been reached so that the proposal submitted by this node will exit, and the write operation requested by the client failed.

In order to determine whether a proposal has enough to respond to a promise promises, the submitting provider simply counts the number of promise messages it receives, and then compares the number of node servers in the system, "enough" means most (N/2 + 1) A process server has responded to a promise promises a period of time. If more than half of the process server does not return a promise, which means that the proposal is not passed by most, then the read algorithm described earlier does not meet most of the requirements, and can not reach a consensus, the proposal to exit. Other errors, including network partitioning, may also prevent most consensus

Phase II: PAOXS acceptance of acceptance

When the prepare/promise stage is completed, it enters the propose/accept stage.

Once the submitter has made a promise from most other process servers, it will ask the promised process server to receive the new value data that they promised to accept, which is a "confirm commit" phase, and if there is no conflicting recommendation failure or partitioning error, then the new proposal will be accepted by all other nodes. Then the Paxos process is complete.

You can see the demo on the right, and note that this demo is a little more than promise at the end of the list, i.e. submitting the new value to the promised process server to accept the new value.

The accepted process may fail, and after replying to the promise message, if there are enough servers to fail in this time period before the accept message is received, then the Paxos is in a bad state: Some process servers accept the new values, Rather than all, this inconsistency has been described in the previous read operation: A client tries to read the values that they agree to receive from most node servers in the system, and it finds that some node servers report different value data, which can cause read failures, but Paxos remains consistent, It is not allowed to have any write operations without consensus, and this bad situation is often repeated in practice to allow most nodes to finally accept. Summary

Paxos algorithm is to ensure that write operations in a distributed system smoothly, to ensure that most of the state of the system is consistent, there is no chance to see inconsistencies, therefore, Paxos algorithm is characterized by consistency > availability.

Vectors clock vector clock is another kind of guaranteed replication algorithm, its characteristic is the usability > the consistency, but once the conflict, does not resemble Paxos to be able to solve itself, needs the human intervention to write the code to solve.

The Paxos algorithm and vector clock are all presented by Leslie Lamport.

Reprinted from Http://www.jdon.com/artichect/paxos.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.