Peer structure and quorum mechanism------"Designing data-intensive Applications" Reading notes 8

Source: Internet
Author: User
Tags cassandra

The previous article involves a lot of leader-related algorithms, we have the wood have thought, princely will phase, rather, since leader so troublesome, simply or use peer model bar, to a everyone equal structure. This article needs and everybody to discuss is the quorum mechanism that realizes the Democratic politics under the multi-copy. As to how it solves all the problems we mentioned earlier, we'll continue this article for a chat.

1. No-leader mechanism

Some data storage systems have abandoned the leader mechanism, allowing any copy to directly accept the user's write operation. ( such as Amazon's Dynamo,facebook Cassandra, although eventually Facebook gave up Cassandra to support HBase, Uber's strong involvement allowed Cassandra to shine in the open source community.) each node that receives a write request to the client is converted to a coordinator node, and the coordinator node does not enforce a specific write order. It is this design difference that has a far-reaching impact on how databases are used and data models.

Read and write multiple copies

How does the no-leader mechanism eliminate the existence of the leader role? The answer is simple: multiple copies read and write . Next we look at a chestnut:

Suppose we have a three-copy structure in the data system, as shown in: User 1234 sends all replicas in parallel to three storage nodes, and two nodes can accept the write of the copy, but one of the nodes is not in line, so the copy write fails. So there are two replicas in three replicas confirming that the write was successful: After user 1234 receives two OK responses, the user considers the write operation to be successful and ignores a copy write failure. ( of course, it is not simple, regardless of the write failure, there will be a repair mechanism to fill this copy of the data )

Now assume that user 2345 starts reading the newly written data. Because a node write failed, user 234 might get an expired value in response. To solve this problem, when user reads data from the data system, it does not just send the request to a replica, but instead sends the read request to multiple replica nodes in parallel . User can get different responses from different nodes, that is, the latest value from another node and the expiration value of another node. The version number is passed here to determine which value is the updated value.

Copy Repair

The no-leader mechanism causes a large number of expired values in the data system, so the process of how a node can fix its own copy to get the latest value is what we call a replica fix , and the no-leader mechanism achieves eventual consistency in this way. the. There are usually several ways to do this:

    • Read fix
      When a user reads multiple nodes in parallel, it can get a response to other expired values. So the user discovers that some of the nodes have expired values, and the user can actively write the new value to the node. This method is called read-fix .

    • Anti-entropy process (actually a physics concept)
      Each data storage node will have a background process that will proactively fix its own expired copy when it finds itself having an expired value, compared to its own copy of the other node. Unlike write-based sequential logs, this anti-entropy process does not replicate writes in any particular order, and there can be significant delays before replicating the data.

2. Quorum mechanism

The example mentioned above has been successfully written on two of the three replicas, and we think the write operation was successful. But what if only one copy of the three copy was written successfully? Is the write operation successful at this point?

The answer is no? Here is actually a simple pigeon nest principle, here I do not do mathematical proof, we are interested can self-proof.
Assuming there are n replicas, each write operation must be confirmed by the W node as successful, and each read operation reads R nodes. (in the example above, n=3,w=2,r=2). As long as W + R > N, if the total number of read and write operations is greater than N, then the read and write operations must have at least one copy of the same, that is, the read operation is bound to read the latest write operation data. This is what we call thequorum mechanism, which requires a quorum for each reading and writing.

Usually N, W, and R are usually configurable to modify these numbers according to your needs. A common choice is to make n an odd number (usually 3 or 5) and set w=r= (n + 1)/2. As shown, if W < N, if a n-w node is unavailable, we can still handle the write operation. Similarly if r

    • n=3,w=2,r=2, we can tolerate a node that is not available.
    • N=5,w=3,r=3, we can tolerate two nodes that are not available.

High availability with hinted handoff

The quorum mechanism implements the final consistency model, but there are some extreme situations in usability that cannot be handled very well. For example, when network jitter occurs, there may still be many nodes that are working correctly. However, the copy should be written to the N node network problems, resulting in less than W or R successful read and write operations, due to not reach the legal number of people, read and write operations will fail. So this time the database system Designers face trade-offs, can you pass some mechanisms to achieve better usability?

So in this case, we can use hinted handoff (forgive me translation is not good). How does this approach be implemented? write and read operations still require a successful response of W and R, but can not be forced to write as specified n nodes ( this involves the consistency hash, the data distribution knowledge, temporarily if cannot understand, I later will have the special topic to write this content, may first put. For example, if you lock yourself out of the door, you may knock on your neighbor's door and ask if you can stay on their couch for a while, and once you find the key, you go home. So the other node can stage a copy that should be placed on the other node, and once the network outage is repaired, the other nodes will transfer the copy to the host node.

So this model not only guarantees the non-violation of quorum mechanism, but also greatly improves the usability of the system and is widely used by No-leader data system.

3 write conflict and quorum mechanism

The same design of the quorum mechanism can itself allow concurrent read and write operations, and tolerate network outages and peak latencies. But this will inevitably bring about consistency, so let's take a look at the following example:

, there are two client A and B, and write the keyword x in a three-copy data storage System. Node 1 receives a write from a, but never receives a write from B because of a network outage. Node 2 first receives a write from a and then receives a B write. Node 3 is the first to receive a write from B, and then receives a write. Node 2 thinks that the final value of X is B, while other node thinks the final value is a.

In such a scenario how the quorum write result becomes a big problem, the idea and the type we mentioned earlier:

    • Last Write Win
      We can attach a timestamp for each write operation, select the maximum timestamp as the most recent value, and discard any values that have a write operation with an earlier timestamp. This conflict resolution algorithm, called last Write Win. This situation requires that each write operation has idempotent, otherwise there will be write loss, how can you ensure that there is no dependent write loss?

    • Merging "Happens-before" relationships
      Whenever there are two operations A and B, there are three possible: a occurs before B, B occurs before a, and a or B is concurrent. What we need is an algorithm that tells us if the two operations are concurrent. If an operation occurs before another operation, then the subsequent operation should overwrite the previous operation, but if the operation is parallel, then we need to resolve a conflict. How to capture and merge the "Happen-before" relationship? You can maintain a version number on the server node, increment the version number each time the write operation, and store the new version number in the write value.
      • Client
        When a client reads a key, the service node returns all the values that are not overwritten, as well as the latest version number. When a client needs to write a key, it must contain the version number from the previous read, and it must merge all the values it received in the previous read.
      • Server
        When a server receives a write with a specific version number, it can overwrite all values of that version number or the following because it knows that it has been merged into the new value, but must keep all values with a higher version number.
    • Version vector

Merging "Happen-before" uses a single version number to capture dependencies between operations, but this is not sufficient to resolve situations where multiple replicas are written in parallel. Instead, we need to use the version number of each replica and each key. Each copy increments its own version number when processing writes, and tracks the version number seen from the other replicas. This information indicates which values to overwrite and which values are saved as a sibling version. A collection of version numbers for all replicas is called a version vector . The version vectors are sent from the data node to the client, so the version vectors allow us to distinguish between overwrite write and parallel write operations.

4. Summary

Okay, so far we've finally summed up the replication mechanism in the distributed system. From the leader-follower mechanism to the multi-leader mechanism, finally to the no-leader mechanism, and detailed summary of the implementation of the mechanism of the details and advantages and disadvantages, I hope you can read after the harvest.

Peer structure and quorum mechanism------"Designing data-intensive Applications" Reading notes 8

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.