This paper describes the paxoslease algorithm, a distributed algorithm for lease negotiation. The Paxoslease is based on the Paxos algorithm, but does not require write disk and clock synchronization. Paxoslease is used in the open source distributed replication kv storage keyspace for master lease negotiation. 1. Introduction
In concurrent programming, locks are the basic primitives that processes use to synchronize access to shared resources. In systems where locks are allocated in a way that does not set an expiration time (nor is there a monitoring process), the holder of the lock fails (Failure) before releasing the lock, which can cause other processes to block.
In highly available systems, it is desirable to avoid a single point of failure that causes the entire system to block. In addition, "Restart" failure of the system will be more difficult to start a multithreaded program. Therefore, in distributed systems, leases replace locks to avoid starvation. A lease is a lock with an expiration time. If the holder of the lock fails or is disconnected from the other nodes, its lease expires automatically and the other nodes are given a lease.
We assume that the basic steps are as follows: The system consists of a set of requesters and a set of recipients, both the requester and the recipient have their own algorithms, and the system has no Byzantine problem, that is, the nodes will not cheat (nor be hack) by non-adherence to their respective algorithms. The number of recipients is fixed.
A simple majority-voting algorithm can correctly solve the problem of distributed leases, which means that at any time the lease is not held by more than one node. However, this simple algorithm frequently blocks when there are multiple requestors, so a more mature scenario is required.
The naïve majority algorithm is this: the requestor initiates a lease that starts a local timeout, a timeout of t seconds, and then sends the request to the recipient as long as T. The recipient initiates a timer that is a T-second after receiving the request, and then sends an acceptance message to the requestor. After the timeout, the recipient clears his or her state. If the recipient receives a request but his status is not empty, the recipient does not respond or send a rejection message. To ensure that only one requestor can obtain a lease at any time, the requestor must receive a message from the recipient of the majority, so that it obtains the lease until its local timer expires.
As discussed above, when there are multiple requesters, it is possible (and most likely) that no requestor can get a majority, and the requestor will always block each other. For example, there are 3 requesters 1, 2, 3, and three recipients A, B, and C, if the distribution state is this: a receives 1 of the request, B accepts 2 and C accepts 3, and then none of the requesters receives the acceptance of the majority. The system must wait for the timeout to expire and the recipient to empty its state, and the requestor will retry. But it is very likely that it will block again.
The workaround described in this article is to take the Paxos [1] scenario and introduce the preparation and proposal phases, which can completely avoid this type of blocking problem [*]. Paxos solve the problem of the replication state machine, each node has a local state machine copy and wants to agree between nodes in the next state transition. Paxos is a majority-based algorithm, which means that the nodes of the majority are not down and can communicate between them, which is possible. Paxos agree on a state transition, so in practice, multiple Paxos instances need to be run successively to negotiate a sequence of state transitions [3]. In Paxos, the recipient must record his or her status on the disk before sending the response, to ensure that once a value (state transition) is selected, the value is always selected, in other words, all state machines undergo the same state transition sequence regardless of whether or not an error occurs.
Unlike the previous Paxos-based distributed lease algorithms, such as Fatlease [5], Paxoslease does not assume that the local clock of the node does the task time synchronization (nor does it require global synchronization). In addition, Fatlease continuously runs the Paxos instance for the lease command, and paxoslease the temporary use of leases completely avoids such complexity, and is a simpler and more elegant algorithm.
Paxoslease is a natural special variant of Paxos. Because the number of nodes in the Paxos is assumed to be fixed (and the node identifier is globally known). Paxoslease handles a special replication state machine in the form of:
Figure 1:paxoslease Distributed State machine
In order to obtain a lease, the Paxoslease requester node submits a value of "node I holds a lease" and automatically returns "No node holds a lease" after the lease expires. The requestor can also extend the lease by committing the "node I hold lease" value again before the previous lease expires, or by releasing the lease (optional) before it expires.
Similar to paxos,paxoslease essentially handles all the relevant failure conditions: node stop and restart network segmentation does not pass message latency in the event of loss of messages and in-sequence transmission 2. Define
A paxoslease unit consists of the requester and the recipient. We assume that there are N recipients and any one of them. In practice, nodes often play the role of both the requester and the recipient, but this is an implementation problem that does not affect the discussion here.
The requestor sends a request for preparation (Prepare request) and a proposal request (propose requests) message to the recipient, and the recipient responds with a readiness response (Prepare Response) and a proposal response (propose Response) message. These messages have the following structure: Prepare request = Poll Number Proposal Request = Poll Number, response result, proposal prepared response accepted = poll number, lease offer response = poll number, response result
A proposal for the composition of voting numbers and leases (proposal). A lease consists of the requester ID (the node that you want to be the owner of the lease) and the time interval T.
The recipient stores the following state information: the highest number of commitments: The recipient ignores a message that the poll number is less than this value accepted proposal: Last accepted proposal (vote number and lease)
There is a globally known maximum lease time of M. The requester requests a lease interval T always < M.
The polling number for each requester is globally unique and increases monotonically. In practice, the way to do this can be that the poll number consists of the Requester ID field, a restart counter, and a counter field for the number of requests (which can handle the worst case scenario). The restart counter increments every time the requestor starts and writes to the reliable storage.
Paxoslease guarantees that the lease is invariant: at any given point in time, no more than 1 requesters will hold the lease. 3. Basic Algorithms
This section describes the basic flow of algorithms, respectively, from the requester and the recipient. The requestor sends the preparation and proposal request, the recipient responds to the preparation and the proposal responds. If everything is normal and then the requester gets the time that the lease spends two rounds of communication. A requestor wants to obtain a lease with a length of T < M. It generates a poll number [Request.ballotnumber] and then sends a prepare request to the recipient of the majority.
The recipient, when receiving the preparation request, checks whether [Request.ballotnumber] is higher than the maximum value in the local voting number committed in [state.highestpromised]. If the proposed poll number is lower, you can discard the message, or send a response that is rejected as a prepared response. If equal or higher, the recipient constructs a prepared response with the accepted answer, with the current accepted proposal [State.acceptedproposal], and the proposal can be empty. The recipient sets the committed maximum voting number [state.highestpromised] to the poll number of the request message [Request.ballotnumber], and then sends the preparation response back to the requestor.
The requestor examines the readiness response coming from the recipient. If the recipient of a majority responds to an empty proposal, it means that they can accept the new proposal, and the requestor can submit itself as the winner of the lease, the length of which is T. The requestor initiates a timer, the expiration time is T seconds, and sends the proposal request, which contains the voting number and the lease (its own requester ID and T).
Proposer::onprepareresponse () {if (Response.ballotnumber! = State.ballotnumber) return//Some other pro Posal if (response.acceptedproposal = = ' empty ') numopen++ if (Numopen < majority) return state.
Timeout = T SetTimeout (state.timeout) Request.type = proposerequest Request.ballotnumber = State.ballotnumber Request.proposal.proposerID = Self.proposerid request.proposal.timeout = state.timeout Broadcast (request)} Pr Oposer::ontimeout () {State.ballotnumber = empty//set in Proposer::P ropose () State.leaseowner = false//set in Proposer::onproposeresponse ()}
The recipient, when receiving the proposal request, checks whether the poll number [Request.ballotnumber] is higher than the maximum value in the local voting number that is committed in [state.highestpromised]. If the proposed poll number is lower, you can discard the message, or send a response that responds with a rejected proposal. If equal or higher, the recipient accepts this proposal: Starts the timeout for the expiration time T, sets the proposal that it has accepted as the proposal received (if the previous proposal was saved, discarded). The recipient constructs a proposal response with an accepted answer, with the poll number [Request.ballotnumber]. After the timeout expires, the recipient resets its accepted proposal to null. The recipient will never reset its committed maximum voting number unless it is restarted.
The requestor examines the proposal response message. If a majority recipient responds to the acceptance proposal, the requestor obtains the lease until the local timer expires (starting in step 3rd). The point at which it receives the last piece of the majority message is the point at which it obtains the lease, which can switch its internal state to "I hold a lease".
As you can see, the recipient does not save his or her state to the store. when restarted, the requestor starts in a blank state. In order to ensure that the node in the restart does not break the lease invariant, the node waits for M seconds before it is re-joined to the network. M is a globally known maximum lease time, and all nodes know that the requestor requests a lease length T always < m seconds.
It is important that the pass is a time interval (relative time), which means that only the requestor who obtains the lease knows that he has a lease. The requestor cannot tell the other nodes that it acquires the lease (similar to the learning message of the classic Paxos) because other nodes cannot know how much time is spent in the process of the learning message being transmitted. Therefore, only the requester who obtains the lease knows that it owns the lease. All other nodes know that they are not getting a lease. In other words, each requester has two states about a lease: "I don't have a lease, I don't know who owns the lease" and "I have a lease." Of course, results can be sent as hint, which can be used in advanced applications or explored, but these are used in ways beyond the scope of this paper.
It is possible that one requestor did not receive a majority acceptance of the response in steps 3rd and 5th. In this case, the requestor can hibernate for a while and then re-execute the algorithm from the 1th step with a higher polling number. 4. Lease invariant Proof
Let's start by giving an intuitive sense of why paxoslease can work. Figure 2 is a drawing explanation: the requestor turns on the timer before sending the proposed request, and the recipient can only turn on their timer after a period of time, and the recipient turns on the timer before sending the proposed response. Therefore, if a majority of the recipient has saved the state and turned on the timer, no other requestor will be able to obtain the lease until the requestor timer expires. There will be no 2 requesters who also consider themselves to be holders of leases.
Figure 2: A time flow chart for a requester to obtain a lease
More formally, the paxoslease ensured that if the proposal issued by the requestor I was B and the time-length was T, it received the receiving message from the majority receiver, assuming that the requestor tnow the timer at the point in time, then no other requestor could receive a majority of the received message until tend = Tstart + T.
Proof: The requester P is assumed to have obtained a lease by voting number B. It received from the recipient of the majority that the type was accepted as an empty preparation response, in point-in-time tstart to start the timer, at the point of time Tacquire received from the majority of the recipient that the type is the accepted offer response, so that the requestor holds the lease until tend = Tstart + T. The recipient majority of the A1, who responded to the preparation request for an empty readiness response, made A2 to accept the P proposal and send the type to be accepted by the recipient majority of the readiness response.
Part I: In the tacquire to tend time, no other requestor Q can obtain a lease with the B ' < B's poll number request. In order to hold a lease, the requestor Q must receive a majority recipient of a ' 2. Make a for the recipient at the same time in a ' 2 and A1. Because B ' < B, a must first accept the proposal of Q and then send the prepared response to P's. But if a sends an empty prepare response to p its state must be empty, its timer must have expired, that is, Q's timer expires, so Q has lost its lease. There is no overlap between the leases for P and Q.
Part II: During the tacquire to tend time, no other requestor Q can obtain a lease with the B < b ' polling number request. In order to hold the lease, the requestor Q must get a majority recipient a ' 1 to send it an empty prepare response. Make a for the recipient at the same time in a ' 1 and A2. Because B < b ', a must first accept P's proposal and then send the prepared response to Q. But since a accepts P's proposal, if it sends an empty prepared response to q its state must be empty, its timer must have expired, i.e. P's timer expires, so P has lost its lease. There is no overlap between the leases for P and Q. 5. Activity (liveness)
Paxos type of algorithm such as Paxoslease, there is a possibility of dynamic deadlock: Two requestors can continuously generate a higher voting number, send the preparation request to the recipient, the recipient continuously increase their commitment to the highest voting number, the result of no requestor can let the recipient accept the proposal. In practice, you can circumvent the requester by waiting for a short period of time before the algorithm is re-executed.
One of the main advantages of the Paxos type algorithm is that there is no static deadlock, which is said in the naïve voting algorithm. There is no static deadlock because the requester can override the recipient's state, and the algorithm guarantees that the majority will not be overwritten. 6. Extension of lease
In some cases, it is important that once a requester holds a resource, it can continue to hold instead of the original lease time. A typical scenario is that, in a distributed system, when the lease points to the master node, it is expected that the node can be used for a long time as master.
To accommodate this requirement, the requestor's algorithm needs to be modified. In the 3rd step, if the majority responds to an empty proposal or an existing proposal (that is, the requestor's lease in the proposal has not expired), it can again propose itself as the holder of the lease. This allows the requestor to extend the time of its lease O (T). The recipient's algorithm does not need to be modified. 7. Release the lease
In the current algorithm description, the requestor's lease expires automatically after a certain amount of time. In some cases, it is important to release the lease as soon as possible so that other nodes are acquired. A typical example is distributed processing, where the processing process obtains a lease on a resource, performs its operations, and then expects to release the lease as soon as possible for other processing to be obtained.
To accommodate this requirement, the requestor can send a specific release message to the recipient, which contains the polling number that it wants to release the lease. Before sending a release message, the requestor switches the internal state from "I hold a lease" to "I do not have a lease." When the recipient receives the release lease, the search is the same as the accepted voting number. Empty your state if you have the same, or do nothing. The requestor can also send a release message to other requestors as a hint that they can get the lease. 8. Leases for multiple resources
The algorithm defines a lease action on a resource R. In practice, nodes deal with multiple resources, such as a lease to be used in a distributed process. Paxoslease can run separate instances for individual resources, different instances of the message, requester, and recipient status flags on the resource identification. A node serves as the requester and the recipient, and each paxoslease instance consumes no more than ~100 bytes, so that 1G of memory on the node can handle the thousands of resource leases. Plus paxoslease does not require hard disk synchronization and clock synchronization, the algorithm can be used in many scenarios that require fine-grained locking. 9. Implement
In Scalien's distributed replication Key-value storage Keyspace [†], paxoslease is used for lease negotiation with master. Keyspace as a reference implementation of Paxoslease, it contains many practical optimizations. Because of the open source AGPL license [6], interested readers are free to get keyspace implementations. Source code and binaries can be downloaded in http://scalien.com [‡]. 10. Genealogy
Leslie Lamport invented the Paxos algorithm in 1990, but only published it in 1998. [2] This paper "The part-time Parliament" is too geek for many readers, which leads to the second paper "Paxos Made Simple". paxos resolves the issue of publishing consistency by introducing two stages of preparation and proposal and allowing the recipient to write their own state to stable storage before responding to the message. Multiple rounds of Paxos can be run sequentially to coordinate the state transitions of the replication state machine.
In the paper "Paxos made Live-an Engineering perspective" and "the Chubby Lock Service for loosely-coupled distributed Systems" [4] The Google internal distributed implementation stack with Paxos, which makes Paxos popular. In Google's chubby, the multiple-round sequential execution of Paxos to achieve consistency in the next write operation in the replicated database provides another way to think about the replication state machine.
The Paxos described in "fatlease:scalable fault-tolerant Lease negotiation with Fatlease" solves the same problem as paxoslease, but it is more complex in structure, Because it mimics the multi-wheel Paxos mentioned in Google paper, rather than the simple recipient state timeout used by Paxoslease. In addition, fatlease need nodes to synchronize their clocks, which makes it unattractive in real-world use. Paxoslease is inspired by fatlease to solve the above shortcomings. Reference Documents
[1] |
Lamport, the part-time Parliament, ACM transactions on computer Systems-2 (May 1998), 133-169. |
[2] |
Lamport, Paxos Made Simple, ACM sigact News, 4 (Dec. 2001), 18-25. |
[3] |
Chandra, R. Griesemer, J. Redstone, Paxos made Live-an Engineering perspective, podc ' 07:26th ACM Symposium on PRINCIPL Es of Distributed Computing |
[4] |
Burrows, the Chubby Lock Service for loosely-coupled distributed Systems, OSDI ' 06:seventh Symposium on Operating System D Esign and implementation. |
[5] |
Hupfeld et al., fatlease:scalable fault-tolerant Lease negotiation with Paxos, HPDC08, June 2327, A, Boston, Massachus ETTs, USA. |
[6] |
AGPL License. Http://www.fsf.org/licensing/licenses/agpl-3.0.html |
Notes
[*] |
Another workaround is to block the system, but introduce a "undo" mechanism that allows the requestor to revoke his request so that some other requestor can obtain the lease. |
[†] |
Scalien's github code works in Https://github.com/scalien |
[‡] |
This site has no content, Keyspace source code can be downloaded in https://github.com/scalien/keyspace. |