From distributed consistency to consensus mechanism (i) Paxos algorithm

Last Update:2018-03-26 Source: Internet

Author: User

Tags prepare

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Based on the CAP theory of distributed system, this paper focuses on the distributed consistency and the consensus problem of blockchain and its solution.

Blockchain is the first large-scale distributed system, the consensus problem is the consistency of distributed system, but it is very different.
In engineering development, it is considered that there is a fault in the system (fault), but there is no malicious (corrupt) node, and the blockchain, especially the open chain, is landed into the physical world, which involves the human nature and the interest relation, the unavoidable existence trust and the malicious attack problem.

Distributed conformance processing is a consensus-based approach (Consensus) problem with node failure condition (that is, possible message is lost or duplicated, but without error message), which is mainly Paxos algorithm and derived raft algorithm.

One, the challenge of distributed system

There is a classic cap theory about distributed systems,

The core idea of the CAP theory is that any network-based data sharing system can meet up to two of the three features of data consistency (consistency), availability (availability), and network partitioning tolerance (Partition tolerance).

Consistency consistency
Consistency means "all nodes see the same data at the same time", that is, when the update operation succeeds and returns to the client, all nodes are fully consistent at the same date. The same as all nodes have the latest version of the data.
Availability Availability

Availability refers to "Reads and writes always Succeed", which is the service is available and is the normal response time.
For an availability distributed system, each non-faulted node must respond to each request. That is, any algorithm used by the system must eventually terminate. When partitioning tolerance is required, this is a strong definition: even for serious network errors, each request must be terminated.

Partition Tolerance Zoning tolerance

Tolerance can also be translated as fault tolerant, zoning tolerance specifically referred to as "the system continues to operate despite arbitrary message loss or failure of part of the system , that is, the system tolerates the network partition, the network is unreachable between partitions, partition tolerance and extensibility are closely related, Partition tolerance specifically in the event of a node or network partition failure, still can provide external to meet the consistency and availability of services.

The way to increase the tolerance of partitioning is to replicate one data item to multiple nodes, and after partitioning, this data item may be distributed across regions. Partition tolerance is improved. However, to replicate data to multiple nodes, there is a consistency problem, that is, the data above multiple nodes may be inconsistent. To ensure consistency, each write operation waits for the entire node to write successfully, and this waiting brings up the usability issue.

, client A can send instructions to server and set the value of update x, client 1 reads the value from the server, in the case of a single point, that is, without a network partition, or through a simple transaction mechanism, you can ensure that the client 1 read is always the most recent value, There is no consistency issue.

If a set of nodes is added to the system, the write operation may succeed on server 1 and fail on server 1, which will read inconsistent values for Client 1 and client 2. If you want to maintain the consistency of the X-value, the write operation must fail at the same time, reducing system availability.

It can be seen that in distributed systems, it is impossible to meet the "consistency", "Availability" and "Partition Fault tolerance" in the CAP law.

In a typical distributed system, in order to ensure the high availability of data, it is common to keep multiple copies of the data (replica), the network partition is a reality, and can only choose between availability and consistency. The CAP theory focuses on the absolute case that, in engineering, usability and consistency are not entirely antagonistic, and our focus is often on how to improve the usability of the system in the context of maintaining relative consistency.

Second, data consistency model

The vast majority of scenarios in the Internet area require the sacrifice of strong consistency in exchange for high availability of the system, which often requires only "eventual consistency", as long as the final time is within the acceptable range of the user.

For consistency, it can be divided into two different perspectives from the server side and the client, i.e. internal consistency and external consistency.
Without a global clock, absolute internal consistency is meaningless, and in general, the consistency we discuss is external consistency. External consistency mainly refers to the problem of how the updated data gets when multiple concurrent accesses are being accessed.

Strong consistency:
When the update operation is complete, access to any number of subsequent processes or threads will return the most recent updated value. This is the user's most friendly, that is, what the user wrote last time, next time will be guaranteed to read what. According to the CAP theory, this implementation requires sacrifice of availability.

Weak consistency:
The system does not guarantee that the continuation process or access to the thread will return the most recent updated value. It takes a while for a user to read an operation to update system-specific data, which we call the "inconsistency window." After the data has been written successfully, the system does not commit to immediately read the latest written value, nor does it specify how long it will be read.

Final consistency:
is a special case of weak consistency. The system guarantees that, without subsequent updates, the system eventually returns the value of the last update operation. In the absence of failure, the time of the inconsistency window is mainly affected by the delay of communication, the number of system load and replica.

The final consistency model can be divided into more models according to the different guarantees it provides, including causality and read-write consistency.

Iii. two-phase and three-phase submissions

In the distributed system, each node is physically independent from each other and communicates and coordinates through the network.
Typically, for example, relational databases, due to the existence of transaction mechanisms, can guarantee that data operations on each individual node can satisfy acid.
However, there is no accurate knowledge of the transaction execution in the other nodes between the nodes that are independent of each other, so the two machines are theoretically unable to achieve a consistent state.

If you want to keep the data consistency across multiple machines in a distributed deployment, then you have to ensure that all data writes are performed on all nodes, either all of them are executed, or none of them are executed.
However, a machine cannot know the execution results of local transactions in other machines while performing local transactions. Therefore, the node does not know whether the transaction should commit or Roolback.

So implementing a distributed transaction requires that the current node know the state of the task execution of the other nodes. The general solution is to introduce a "coordinator" component to uniformly schedule the execution of all distributed nodes. Notable are the two-phase commit protocol (Phase commitment Protocol) and the three-phase commit agreement (three Phase Commitment Protocol).

1. Two-phase submission agreement

The phase refers to the commit-request phase commit phase.

Request phase
During the request phase, the coordinator notifies the transaction participants that they are ready to commit or cancel the transaction before entering the voting process.
During the voting process, participants will inform the facilitator of their own decision: consent (the transaction Contributor local job execution succeeds) or cancel (local job execution failure).
Submission Phase
At this stage, the facilitator will make a decision based on the poll results of the first stage: Commit or Cancel.
The coordinator notifies all participants to cancel the transaction when and only if all participants agree to submit the Transaction Coordinator to notify all participants to commit the transaction. The action that the contributor will perform after receiving a message from the coordinator.

As you can see, there are obvious problems with the two-phase commit protocol:

Synchronous blocking
During execution, all participating nodes are transaction exclusive states, and third-party nodes access public resources are blocked when they occupy public resources.
Single point problem
Once the coordinator fails, the participant will continue to block.
Data inconsistency
In the second stage, it is assumed that the Coordinator sent a notification of a transaction commit, but because of the network problem the notification was received only by a subset of participants and executed a commit, and the remaining participants were not notified until they were blocked, which resulted in inconsistencies in the data.

2. Three-phase Submission agreement

Three phase are cancommit, Precommit and Docommit respectively.

Three-phase submissions have been improved for two-phase submissions:

Introduce a timeout mechanism. In 2PC, only the coordinator has a timeout mechanism, and 3PC also introduces a timeout mechanism in both the coordinator and the participant.
Insert a preparation phase in the first and second phases. Ensures that the state of each participating node is consistent before the final commit phase.

The proposed Paxos algorithm

Two-phase commit or three-phase commit can not solve the distributed consistency problem well, until the Paxos algorithm is proposed, Paxos protocol was first proposed by Leslie Lamport in 1990, and it has become the most widely distributed consistency algorithm.

Google Chubby author Mike Burrows said there is only one consistency algorithm in the world, that is Paxos, the other algorithms are defective.

1. Node roles

In the Paxos protocol, there are three types of nodes:

Proposer: Sponsors

Proposer can have multiple, proposer proposed motions (value). The so-called value, in the project can be any operation, such as "modify the value of a variable to a certain value", "set the current primary to a node" and so on. These operations are abstracted as value in the Paxos protocol.
Different proposer can present different or even contradictory values, such as one proposer proposal "set variable X to 1" and the other proposer propose "set variable X to 2", but for the same round Paxos process, up to one value is approved 。

Acceptor: Approver

Acceptor has N, the value proposed by proposer must obtain more than half (n/2+1) of
Acceptor approval before passing. The acceptor is completely peer independent.

Learner: Learner

Learner to learn the approved value. The so-called learning is by reading each proposer the value of the selection results, if a value is more than half proposer through, then learner learned this value.

Here is similar to the Quorum parliamentary mechanism, some value needs to obtain W=N/2 + 1 acceptor approval, learner need to read at least n/2+1 Accpetor, at most read N acceptor results, can learn to a passing VA Lue

2. Constraint conditions

The above three types of roles are only logical division, in practice, a node can act as the three roles. Some articles add a client role that, as an issue, does not actually participate in the electoral process.

The proposer and acceptor in Paxos are the core roles of the algorithm, and Paxos describes how multiple proposer and multiple acceptor can be used to agree on a variety of proposals made by acceptor in a system composed of several proposer Process, and learner is simply "learning" the proposal that is ultimately approved.

The Paxos protocol process also needs to meet several constraints:

Acceptor must accept the first proposal it received;
If the V value of a proposal is accepted by most acceptor, then all subsequent accepted proposals must also contain a V value (the V value can be understood as the content of the proposal, and the proposal consists of one or more V and the proposal number);
If a certain value is approved by a round Paxos agreement, subsequent rounds of Paxos can only approve the value;

Each round of Paxos agreement is divided into the preparatory phase and the approval phase, in which proposer and acceptor have their own processes.

The interaction between proposer and acceptor consists of 4 types of message communication, such as:

These 4 types of messages correspond to the two phases of the Paxos algorithm 4 procedures:

Phase 1
A) Proposer send prepare messages to more than half of the acceptor within the network
b) Acceptor reply to Promise message under normal circumstances
Phase 2
A) Proposer send an accept message when there are enough acceptor to reply to promise message
b) Acceptor reply accepted message under normal circumstances

3. Electoral process

Phase 1 Preparation phase

The proposer generates a globally unique and incremented Proposalid, sending prepare requests to all machines in the Paxos cluster, without value, carrying only N, or proposalid.

Acceptor receives the prepare request, it determines whether the received Proposalid is greater than the N of all proposals that have been previously responded to:
If it is, then:
(1) Persistent N in the local, can be remembered as max_n.
(2) Reply to the request and bring in the proposal that has been accept n the largest value (if no proposal has been taken at this time, return value is null).
(3) Make a commitment: will not accept any proposal less than max_n.

If no: do not reply or reply to error.

Phase 2 Election phase

P2a:proposer Send Accept
After a period of time, proposer collected some Prepare replies, in the following situations:
(1) Reply quantity > Half acceptor quantity, and all reply value is empty, then Porposer issue accept request and bring the value that you specified.
(2) Reply quantity > Half acceptor quantity, and some reply value is not empty, then Porposer issue accept request, and take reply Proposalid the biggest value (as own proposal content).
(3) Reply quantity <= Half of the number of acceptor, then try to update the resulting larger proposalid, and then go p1a execution.

P2b:acceptor Answer Accept
Accpetor after receiving the Accpet request, Judge:
(1) N >= Max_n received (typically equals), the reply is submitted successfully, and N and value are persisted.
(2) N < Max_n received, do not reply or reply to the submission failed.

P2c:proposer Statistics Poll
After a period of time, proposer collected some Accept replies submitted successfully, there are several situations:
(1) The number of replies > half of the acceptor quantity indicates that the value submitted is successful. At this point, you can send a broadcast to all proposer, learner, notifying them of the commit value.
(2) Reply quantity <= Half of the number of acceptor, then try to update the resulting larger proposalid, and then go p1a execution.
(3) If you receive a reply that failed to submit, try to update to generate a larger proposalid, and then go to p1a execution.

4. Related discussions

The core idea of the Paxos algorithm:
(1) Introduction of multiple acceptor, a single acceptor similar to the 2PC in the single point of the problem, to avoid failure
(2) Proposer with a larger proposalid to preempt temporary access, you can compare the 2PC protocol to prevent one of the proposer crash outages causing blocking problems
(3) guaranteed an n value, only one proposer can be run to the second stage, proposer in the order of Proposalid increment sequentially
(3) The proposer of the new proposalid such as the value of the previous submission, the increment of the value of Proposalid is an inheritance relationship

Why are acceptor failures within half of the Paxos running during the run?
(1) If the final value is not determined in half of the acceptor, then all proposer will have the right to contest the proposal, and a proposal will be submitted successfully. After that, there will be half a acceptor to commit the success with this value.
(2) If the final value is determined after half of the acceptor fails, all proposer must be submitted with the final value before the commit, and this value can be obtained and no longer modified.

How to produce a unique number?
In "Paxos Made Simple" it is mentioned that all proposer do not intersect the data collection to choose from, such as the system has 5 proposer, you can assign each proposer an identity J (0~4), The number of each proposer each time a resolution is made can be 5*i + j (I can be used to indicate the number of times a motion is proposed).

Three articles related to Larmport and Paxos are recommended:
The part-time Parliament
Paxos Made Simple
Fast Paxos

2pc/3pc and Paxos protocols are classic distributed protocols, and after understanding them, learning other distributed protocols can be a lot simpler.

Reference:
CAP theorem
Elementary discussion on the basic problem of distributed system: usability and consistency
Introduction to distributed systems to combat
Graphical distributed Conformance Protocol Paxos
Summary of Paxos Protocol learning

From distributed consistency to consensus mechanism (i) Paxos algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More