Document directory
Http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/, distributed algorithms in nosql Databases
ScalabilityIs one of the main drivers of the nosql movement.
As such, it encompasses distributed system coordination, failover, resource management and other capabilities. It sounds like a big umbrella, and it is.
Although it can hardly be said that nosql movement brought should launch new techniques into distributed data processing, it should have an understanding of practical studies and real-life trials of different combinations of protocols and Algorithms. these developments gradually highlight a system of relevant database building blocks with proven practical efficiency.
Scalability is a major problem that nosql needs to solve.
Nosql does not bring fundamental innovations to distributed data processing, but it brings a wave of distributed pragmatism...
Data Consistency
It is well known and fairly obvious that in geographically connected systems or other has with probable network partitions or delays it is not generally possible to maintain high availability without consistency because isolparts of the database have to operate independently in case of network partition. this fact is often referred to as the CAP theorem. however, consistency is a very expensive thing in distributed systems, so itCan be traded not only to availability. It is often involved into multiple tradeoffs. to study these tradeoffs, we first note that consistency issues in distributed systems are induced by the replication and the spatial separation of Coupled data, so we have to start with goals and desired properties of the replication:
According to cap theory, the current system often sacrifices consistency in exchange for availability, because it is very difficult to ensure data consistency in distributed systems. the author believes that in addition to availability, there are other factors that need to sacrifice consistency. The factors are as follows.
- Availability. Isolated parts of the database can serve read/write requests in case of network partition.
- Read/writeLatency. Read/write requests are processes with a minimal latency.
- Read/writeScalability. Read/write load can be balanced into SS multiple nodes.
- Fault-Tolerance. Ability to serve read/write requests does not depend on availability of any particle node.
- Data Persistence. Node failures within certain limits do not cause data loss.
Consistency
Consistency is a muchMore complicatedProperty than the previous ones, so we have to discuss different options in detail. it beyond this article to go deeply into theoretical consistency and concurrency models, so we use a very lean framework of simple properties.
Read-write consistency From the read-write perspective, the basic goal of a database isMinimize a replica convergence time(How long does it take to propagate an update to all replicas) and guarantee eventual consistency. Besides these weak guarantees, one can be interested in stronger consistency properties:
- Read-after-write consistency. The effect of a write operation on data item X, will always be seen by a successive read operation on X.
- Read-after-read consistency. If some client reads the value of a data item X, any successive read operation on X will always return that same or a more recent value.
Read consistency. It is easy to understand. It takes time to update a replica. It takes time to spread this update to all replicas. Therefore, we need to minimize this time and ensure final consistency.
However, in this process, the read operation will definitely be inconsistent with different replicas. This is the read-after-write consistency.
- Write-write consistency
Write-write conflicts appear in case of database partition, so a database shocould either handle these conflicts somehow or guarantee that concurrent writes will not be processed by different partitions. from this perspective, a database can offer different consistency models:
- Atomic writes. if a database provides an API where a write request can only be an independent atomic assignment of a value, one possible way to avoid write-write conflicts is to pick the "most recent" version of each entity. this guarantees that all nodes will end up with the same version of Data irrespectively to the order of updates which can be affected by network failures and delays. data version can be specified by a timestamps or application-specific metric. this approach is used for example in Cassandra.
- Atomic read-Modify-write. applications often do a read-Modify-write sequence instead of independent atomic writes. if two clients read the same version of data, modify it and write back concurrently, the latest update will silently override the first one in the atomic writes model. this behavior can be semantically inappropriate (for example, if both clients Add a value to a list ). A database can offer at least two solutions:
- Conflict prevention. read-Modify-write can be thought as a special case of transaction, so distributed locking or consensus protocols like paxos [20, 21] are both a solution. this is a generic technique that can support both atomic read-Modify-write semantics and arbitrary isolated transactions. an alternative approach is to prevent distributed concurrent writes entirely and route all writes of a particle data item to a single node (Global master or shard master ). to prevent conflicts, a database must sacriice availability in case of network partitioning and stop all but one partition. this approach is used in parallel systems with strong consistency guarantees (e.g. most RDBMSs, hbase, MongoDB ).
- Conflict Detection. A database track concurrent conflicting updates and either rollback one of the conflicting updates or preserve both versions for resolving on the client side. concurrent updates are typically tracked by using vector clocks [19] (which can be though as a generalization of the Optimistic Locking) or by preserving an entire version history. this approach is used in systems like Riak, Voldemort, couchdb.
Write consistency is more complex. In the case of concurrent writes, various write operations may conflict and overwrite each other.
The first is a simple case with independent write operations. When writing, you do not need to care about the previous values and simply update the status. the only problem to consider in this case is the timing problem. We must ensure that the "most recent" version is updated. However, because of network failures and delays, the "most recent" version may be later.The common method is to ensure this through timestamps or application-specific metric.
The second type is more complicated. If read-Modify-write is not controlled, it is very likely that other concurrent writes will update the value, or, the two clients read this value and execute read-Modify-write. This causes a write conflict.
There are two solutions,
- Prevention in advance to ensure high consistency. Use the distributed lock or paxos protocol to ensure consistency. another idea is to use the master to coordinate the order of all concurrent writes, such as hbase and MongoDB.
- Detection after the event, high availability, each with its own version first, so that when the version is conflict, then the client side resolving, such as dynamo, couchdb, all use this method.
Now let's take a closer look at commonly used replication techniques and classify them in accordance with the described properties. the first figure below depicts logical relationships between different techniques and their coordinates in the system of the consistency-scalability-availability-latency tradeoffs. the second figure extends strates each technique in detail.
The attributes to be considered in distributed system design are given above,Consistency-scalability-availability-latency. Different designs are tradeoffs between attributes,The first figure below shows the specific tradeoff situation, while the second figure depicts a specific design idea.
Replication factor 4. It is assumed that read/write Coordinator can be either an external client or a proxy node within a database.
Let's go through all these techniques moving from weak to strong consistency guarantees:
- (A, anti-entropy)Weakest consistencyGuarantees are provided by the following strategy. writer updates any arbitrary selected replica. reader reads any replica and sees the old data until a new version is not propagated via background anti-entropy protocol (more on Anti-entropy protocols in the next section ). the main properties of this approach are:
- High propagation latencyMakes it quite impractical for data synchronization, so it is typically used only as an auxiliary background process that detects and repairs unplanned inconsistencies. however, databases like Cassandra use anti-entropy as a primary way to propagate information about database topology and other metadata.
- ConsistencyGuarantees arePoor: Write-write conflicts and read-write discrepancies are very probable even in absence of failures.
- Superior availabilityAnd robustness against network partitions. This schema provides good performance because individual updates are replaced by asynchronous batch processing.
- PersistenceGuarantees areWeakBecause new data are initially stored on a single replica.
- (B) an obvious improvement of the previous schema is to send an update to all (available) Replicas asynchronously as soon as the update request hits any replica. it can be considered as a kind of targeted anti-entropy.
- In comparison with pure anti-entropy, this greatly improves consistency with a relatively small performance penalty. However, formal consistency and persistence guarantees remain the same.
- If some replica is temporary unavailable due to network failures or node failure/replacement, updates shocould be eventually delivered to it by the anti-entropy process.
- (C) In the previous schema, failures can be handled better usingHinted HandoffTechnique [8]. updates that are intended for unavailable nodes are recorded on the Coordinator or any other node with a hint that they shocould be delivered to a certain node as soon as it will become available. this improves persistence guarantees and replica convergence time.
- (D, read one write one) since the carrier of hinted handoffs can fail before deferred updates were propagated, it makes sense to enforce consistency by so-called read repairs. each read (or randomly selected reads) triggers an asynchronous process thatRequestsADigest(A kind of signature/hash) of the requested dataFrom all replicasAnd reconciles inconsistencies if detected.
We use termReadone-writeone for combination of techniques a, B, c and d-They all do not provide strict consistency guarantees, but are efficient enough to be used in practice as an self-contained approach.
A, B, C, D can be calledReadone-writeone, which uses tradeoff consistency to obtain available, R/W latency, and scalability
A is the most distinctive feature. It only guarantees the lowest consistency, and obtains the highest availability. It only updates any replica, and then completely relies on Anti-entropy to disseminate updates.
B. To reducePropagation latency: The update is asynchronously sent to all copies, which improves the propagation efficiency. The cost is that the R/W latency and scalability are slightly reduced, you need to obtain and send the location information of all copies.
In B, the update is sent to all copies asynchronously, but the update is not guaranteed to be successful. If there is a replic fail, only the later parts are synchronized through anti-entropy, so there is no sacrifice of availability.
C. ProvideHinted HandoffTechnique improves fail node synchronization efficiency to improve persistence and replica convergence time.
Hinted handoff technique. To put it bluntly, the update is temporarily placed in the coordinator or any other node, and then continuously listens to the fail node. Once it is restored, the update is automatically synchronized. hinted means that handoff is transparent to the client, which implies. how can I think of such a strange name?
D. When reading any duplicate, request digest for other replicas and then reconcile for inconsistency. This should greatly improve the read consistency, but it also sacrifices available, r/W latency.
- (E, read quorum write quorum) The strategies above are heuristic enhancements that decrease replicas convergence time. to provide guarantees beyond eventual consistency, one has to sacriice availability and guarantee an overlap between read and write sets. A common generalization is to write synchronously W replicas instead of one and touch R replicas during reading.
- First, this allows one to manage persistence guarantees setting W> 1.
- Second, this improves consistencyR + W> NBecause synchronously written set will overlap with the set that is contacted during reading (in the figure above W = 2, R = 3, n = 4 ), so reader will touch at least one fresh replica and select it as a result. this guarantees consistency if read and write requests are issued sequentially (e.g. by one client, read-your-writes consistency), but do not guarantee global read-after-read consistency. consider an example in the figure below to see why reads can be inconsistent. in this example R = 2, W = 2, n = 3. however, writing of two replicas is not transactional, so clients can fetch both old and new values until writing is not completed:
- Different values of R and W allowsTrade write latency and persistence to read latencyAnd vice versa.
- Concurrent writers can write to disjoint quorums if W <= n/2.Setting W> n/2 guarantees immediate conflict detectionIn atomic read-Modify-write with rollbacks model.
- Strictly speaking, this schema is not tolerant to network partitions, although it tolerates failures of separate nodes. In practice, heuristics likeSloppy Quorum[8] can be used to sacriice consistency provided by a standard quorum schema in favor of availability in certain scenarios.
"Sloppy quorum" and "sloppy quorum" will calculate the number of successfully written temporary nodes through hinted handoff to solve the problem of fail on some temporary nodes.
- (F, read all write quorum) The problem with read-after-read consistency can be alleviated by contacting all replicas during reading (reader can fetch data or check Digests ). this ensures that a new version of data becomes visible to the readers as soon as it appears on at least one node. network partitions of course can lead to violation of this guarantee.
E, F can be called read quorum write quorum. This design provides a good balance for consistency and available tradeoff, and ensures eventual consistency under high availability, amazon dynamo uses this solution.
As longR + W> N to ensure that the read operation can read at least the latest replica. By adjusting R, the value of W can be trade write latency and readLatency.
Ensure that w> n/2, you can immediately discover write conflict and rollback, but whether to eliminate conflict during write is also a policy problem, dynamo is to ensure that it can be written forever, instead of using this policy, conflict is handed over to the client for resolution during reading.
To ensure partition fault tolerance, sloppy quorum technology can be used.
Of course, when the write process is not completed, the new data may not be read when the data is read. For details, if you want to solve this problem, you can use read all write quorum.
- (G, master-slave) The TechniquesAboveAre often used to provide either atomic writes or read-Modify-writeConflict DetectionConsistency levels. To achieveConflict preventionLevel, one has to use a kind of centralization or locking. A simplest strategy is to use master-slave asynchronous replication. all writes for a participant data item are routed to a central node that executes write operations sequentially. this makes master a bottleneck, so it becomes crucial to partition data into independent shards to be scalable.
- (H, transactional read quorum write quorum and read one write all) quorum approach can also be reinforced by transactional techniques to prevent write-write conflicts. A well-known approach is to use two-phase commit protocol. however, two-phase commit is not perfectly reliable because Coordinator failures can cause resource blocking. paxos commit protocol [20, 21] is a more reliable alterative, but with a price or performance penalty. A small step forward and we end up with the read one write all approach where writes update all replicas in a transactional fashion. this approach provides strong fault-tolerant consistency but with a price of performance and availability.
A ~ F gives priority to high availability, and supports eventual consistency at most. G and H are strongly consistent solutions.
A Simple Method of strong consistency, master-slave, is arranged in a unified manner through the master to avoid the conflict, hbase and MongoDB solutions. Of course, the single point of failure of the Master cannot be avoided.
For decentralized concurrent writes without a master node, to ensure strong consistency, the most basic thing is to use two-phase commit protocol to consider Coordinator failures, you can use paxos commit protocol (supporting leader election)
You can also use read one write all approach or quorum approach.
Conclusion,
The author talks about consistency and is still thorough,
For read-write consistency, it is basically solved through quorum approach, for example, read quorum write quorum. The reason is quorum instead of all, considering R/W latency and available tradeoff
Write-write consistency is mainly caused by read-Modify-write, or the availability is preferred.Conflict Detection, or use consistency firstConflict prevention