In-depth analysis of distributed algorithms for NoSQL Databases
System scalability is the main reason for promoting the development of NoSQL, including distributed system coordination, failover, resource management and many other features. In this case, NoSQL sounds like a big basket, and everything can be inserted. Although the NoSQL movement has not brought about fundamental technological changes to distributed data processing, it still triggers overwhelming research and practices on various protocols and algorithms. It is through these attempts that some effective database construction methods have been gradually summarized. In this article, I will systematically describe the distributed features of NoSQL databases.
Next, we will study some distributed strategies, such as the replication in fault detection. These strategies are marked in italics and divided into three parts:
- Data consistency. NoSQL requires a balance between consistency, fault tolerance, and performance of distributed systems, low latency, and high availability. Generally, data consistency is a required option, therefore, this section focuses on data replication and data recovery.
- Data placement. A database product should be able to cope with different data distribution, cluster topology and hardware configuration. In this section, we will discuss how to distribute and adjust the data distribution so that faults can be solved in a timely manner, and ensure persistence, efficient query and guarantee of cluster resources (such as memory and hard disk space) balanced usage.
- Peer-to-Peer System. Technologies such as leader election have been used in multiple database products to achieve fault tolerance and strong data consistency. However, even distributed databases (with no center) need to track their global states and detect faults and topology changes. This section describes several techniques for keeping the system in a consistent state.
Data Consistency
As we all know, distributed systems often encounter network isolation or latency. In this case, the isolated part is unavailable. Therefore, it is impossible to maintain high availability without sacrificing consistency. This fact is often referred to as the "CAP theory ". However, consistency is very expensive in distributed systems, so we often need to make some concessions on it, not just for availability, but also for a variety of trade-offs. To study these trade-offs, we noticed that the consistency problem of distributed systems is caused by data isolation and replication, so we will start from studying the characteristics of replication:
- Availability. When the network is isolated, the rest can still respond to read/write requests.
- Read/write latency. Read/write requests can be processed in a short time.
- Read/write scalability. The read/write pressure can be evenly shared by multiple nodes.
- Fault Tolerance. The processing of read/write requests does not depend on any specific node.
- Data persistence. Node faults under certain conditions will not cause data loss.
- Consistency. Consistency is much more complex than the previous features. We need to discuss several different points of view in detail. However, we will not involve too many consistency theories and concurrency models, because this is beyond the scope of this article. I will only use a streamlined system with some simple features.
- Read/write consistency. From the perspective of reading and writing, the basic goal of the database is to make the replica convergence time as short as possible (that is, the time when the update is passed to all copies) to ensure final consistency. In addition to this weak guarantee, there are some more consistent features:
- Read consistency after writing. Write operations on data item X are always visible to subsequent read operations on data item X.
- Read consistency after reading. After a read operation on data item X, subsequent read operations on data item X should return the same or newer value as the first return value.
- Write consistency. Partition databases often encounter write conflicts. The database should be able to handle such conflicts and ensure that multiple write requests are not processed by different partitions. In this regard, the database provides several different consistency models:
- Atomic write. If the database provides APIs, one write operation can only be an atomic value assignment. To avoid write conflicts, find the "latest version" of each data ". This allows all nodes to obtain the same version at the end of the update, regardless of the update sequence. network faults and latencies often result in different node update sequence. Data versions can be expressed by timestamps or user-specified values. Cassandra uses this method.
- Atomic read-Modify-write. An application sometimes needs to perform read-Modify-write sequence operations instead of independent atomic write operations. Assume that two clients read data of the same version, write the modified data back, and write the data according to the atomic write model. The later update will overwrite the previous one. This behavior is incorrect in some cases (for example, two clients add new values to the same list value ). The database provides at least two solutions:
- Conflict prevention. Read-Modify-write can be considered as a transaction in special circumstances. Therefore, consistent protocols such as distributed locks and PAXOS can solve this problem. This technology supports atomic read rewriting semantics and arbitrary isolation level transactions. Another method is to avoid distributed concurrent write operations and route all write operations on specific data items to a single node (either the global master node or the partition master node ). To avoid conflicts, the database must sacrifice the availability in the case of network isolation. This method is often used in many systems that provide strong consistency assurance (such as most relational databases, HBase, and MongoDB ).
- Conflict Detection. The Database tracks the conflicts of concurrent updates, and chooses to roll back one of them or maintain two versions to the client for resolution. Concurrent updates are usually tracked using a vector clock (which is an optimistic lock), or a complete version history is maintained. This method is used for Riak, Voldemort, and CouchDB.
Now let's take a closer look at common Replication technologies and give them classes based on the characteristics described. The first figure depicts the logical relationship between different technologies and the trade-off between different technologies in terms of system consistency, scalability, availability, and latency. The second figure details each technology.
The duplicate factor is 4. The read/write Coordinator can be an external client or an internal proxy node.
Quick Start to NoSQL databases. For details about how to download high-definition PDF, see
Basic knowledge about NoSQL Databases
Key to enterprise application of NoSQL
We will repeat all the technologies from weak to strong based on consistency:
(A, anti-entropy) consistency is weakest, based on the following policy. During the write operation, select any node for update. If the new data is not transmitted to the read node through the backend anti-entropy protocol, the old data is still read. (The next section will detail the anti-entropy protocol ). The main features of this method are:
- High propagation latency makes it less useful in data synchronization. Therefore, it is typically used to detect and repair unplanned inconsistencies only as an auxiliary function. Cassandra uses the antientropy algorithm to transmit the database topology and other metadata information between nodes.
- Poor consistency guarantee: Write conflicts and read/write inconsistencies may occur even if no fault occurs.
- High Availability and robustness under Network isolation. Asynchronous batch processing replaces updates one by one, which improves performance.
- Durability guarantee is weak because new data only has a single copy at first.
(B) An improvement in the above mode is that an update is asynchronously sent to all available nodes when any node receives an update request. This is also considered a targeted inverse entropy.
- Compared with pure antientropy, this method greatly improves consistency with only a small performance sacrifice. However, formal consistency and persistence remain unchanged.
- If some nodes are unavailable because of network faults or node failures, the update will be transmitted to the node through the Anti-entropy propagation process.
(C) In the previous mode, use the prompt transfer technology to better handle failed operations on a node. The expected update of the failed node is recorded on the additional proxy node, and indicates that the update will be passed to the node once the feature node is available. This improves consistency and reduces the replication convergence time.
(D, one-time read/write) because the responsible node that prompts the transfer may also have expired before the update is passed out, it is necessary to ensure consistency through the so-called read fix in this case. Each read operation starts an asynchronous process and requests a data digest (such as signature or hash) to all nodes that store the data ), if the summary returned by each node is inconsistent, the data versions on each node are unified. We use the one-time read/write naming to combine A, B, C, and D Technologies-they do not provide strict consistency guarantees, but they can be used as A self-prepared method for practice.
(E, read several writes) the above policy is to reduce the heuristic enhancement of the replication convergence time. To ensure higher consistency, availability must be sacrificed to ensure a certain degree of read/write overlap. The common practice is to write W replicas at the same time instead of one, and read R replicas at the same time.
- First, you can configure the number of write replicas W> 1.
- Second, because R + W> N, the written and read nodes must overlap, therefore, at least one of the read data copies is relatively new (W = 2, R = 3, N = 4 in the figure above ). In this way, the read/write consistency can be ensured when the read/write requests are executed in sequence (after the write is executed, the read/write consistency of a single user is ensured), but the global read consistency cannot be guaranteed. As shown in the following figure, R = 2, W = 2, N = 3, because the update of two replicas is non-transactional, when the update is incomplete, the read may read two old values or the new and old values:
- For certain read latency requirements, setting different values of R and W can adjust the write latency and durability, and vice versa.
- If W <= N/2, multiple concurrent writes are written to different nodes (for example, N/2 before write, n/2 after B is written ). Setting W> N/2 ensures that conflicts are detected in a timely manner when atomic read rewriting meets the rollback model.
- Strictly speaking, although this mode can tolerate the failure of individual nodes, it is not good for the fault tolerance of network isolation. In practice, we often use the "approximate quantity" Method to Improve availability in some scenarios by sacrificing consistency.
(F) read consistency issues can be mitigated by accessing all copies (Reading data or checking summaries) during data reading. This ensures that the new data on at least one node can be viewed by the reader. However, in the case of network isolation, this guarantee does not work.
(G, Master/Slave) is often used to provide read rewriting at the level of atomic write or conflict detection persistence. To achieve the level of conflict prevention, you must use a centralized management method or lock. The simplest strategy is to use master-slave asynchronous replication. Write operations on specific data items are routed to a central node and executed sequentially. In this case, the master node becomes a bottleneck, so data must be divided into independent regions (different slices have different masters) to provide scalability.
(H, Transactional Read Quorum Write Quorum and Read One Write All) to update multiple replicas, you can use transaction control technology to avoid Write conflicts. The common method is to use a two-phase commit protocol. However, the two-phase submission is not completely reliable, because invalid coordinators may cause resource blocking. PAXOS submission protocol is a more reliable option, but it will lose some performance. On this basis, the next small step is to read a copy and write all the copies. This method puts the updates of all the copies in a transaction, it provides strong fault tolerance consistency, but will lose some performance and availability.
It is necessary to emphasize the trade-offs in the above analysis.
- Consistency and availability. Strict trade-offs have been given by the CAP theory. In the case of network isolation, databases must either centralize data or accept the risk of data loss.
- Consistency and scalability. Even though read/write consistency reduces the scalability of replica sets, only the atomic write model can handle write conflicts in a relatively scalable way. The atomic read rewrite Model adds temporary global locks to the data to avoid conflicts. This indicates that the dependency between data or operations, even in a very small range or a short time, will damage the scalability. Therefore, it is very important to carefully design a data model and store data in parts separately for scalability.
- Consistency and latency. As mentioned above, when the database needs to provide strong consistency or durability, it should be biased towards reading and writing all copies. However, it is obvious that the consistency is inversely proportional to the request latency. Therefore, using several replicas will be a relatively acceptable method.
- Failover and consistency/scalability/delay. Interestingly, the conflict between fault tolerance and consistency, scalability, and latency is not intense. By reasonably dropping some performance and consistency, the cluster can tolerate failure of up to nodes. This compromise is evident in the difference between the two-phase commit and the PAXOS protocol. Another example of this compromise is to increase specific consistency protection. For example, the use of strict session processes as "read by yourself" increases the complexity of failover.
Anti-entropy protocol and rumor Propagation Algorithm
Let's start with the following scenarios:
There are many nodes, and each piece of data will have copies on several of them. Each node can process the update request separately, and each node regularly synchronizes with other nodes. After a period of time, all the copies will be consistent. How is the synchronization process performed? When does synchronization start? How do I select a synchronization object? How to exchange data? We assume that the two nodes always use a newer version of data to overwrite the old data or that both versions are retained for processing at the application layer.
This problem is common in scenarios such as Data Consistency Maintenance and cluster status synchronization (such as cluster member information dissemination. Although the introduction of a database monitoring and synchronization plan coordinator can solve this problem, decentralized databases can provide better fault tolerance. The main approach to decentralization is to use a well-designed infectious protocol, which is relatively simple but provides a good convergence time and can tolerate failure of any node and network isolation. Although there are many types of infectious algorithms, we only focus on the anti-entropy protocol, because NoSQL databases are using it.
The anti-entropy Protocol assumes that the synchronization will be executed according to a fixed schedule. Each node is randomly selected or another node is selected to exchange data according to certain rules to eliminate the difference. There are three anti-entropy protocols: Push, pull, and hybrid. The principle of the push protocol is to simply select a random node and send the data status to it. It is silly to push all the data out of real applications, so nodes generally work in the way shown.
Node A prepares A Data Abstract As the synchronization initiator, which contains the data fingerprint of node. After receiving the summary, Node B compares the data in the summary with the local data and returns the data difference as A summary to node. Finally, A sends an update to B and B updates the data. The pull and hybrid protocols are similar, as shown in.
The anti-entropy protocol provides sufficient convergence time and scalability. Displays the simulation results of an update propagation in a cluster with 100 nodes. In each iteration, each node is associated with only one randomly selected peer node.
We can see that the pull method has better convergence than the push method, which can be proved theoretically. There is also a problem of "tail convergence" in the push mode. After multiple iterations, although almost all nodes are traversed, few of them are not affected. Compared with the simple push and pull method, the hybrid method is more efficient, so this method is usually used in practical applications. The inverse entropy is scalable because the average conversion time increases in the form of a logarithm function of the cluster Scale.
Although these technologies seem simple, there are still many studies focusing on the performance of anti-entropy protocols under different constraints. One of them replaces random selection with a more effective structure using a network topology. Adjust the transmission rate or use advanced rules to select the data to be synchronized when the network bandwidth is limited. Abstract computing is also facing challenges. The database maintains a recently updated log to facilitate abstract computing.
For more details, please continue to read the highlights on the next page: