In-depth analysis of NoSQL database distributed algorithms (graphic details), nosql text

Source: Internet
Author: User
Tags cassandra redis cluster

In-depth analysis of NoSQL database distributed algorithms (graphic details), nosql text

Although the NoSQL movement has not brought about fundamental technological changes to distributed data processing, it still triggers overwhelming research and practices on various protocols and algorithms. In this article, I will systematically describe the distributed features of NoSQL databases.

System scalability is the main reason for promoting the development of NoSQL, including distributed system coordination, failover, resource management and many other features. In this case, NoSQL sounds like a big basket, and everything can be inserted. Although the NoSQL movement has not brought about fundamental technological changes to distributed data processing, it still triggers overwhelming research and practices on various protocols and algorithms. It is through these attempts that some effective database construction methods have been gradually summarized. In this article, I will systematically describe the distributed features of NoSQL databases.

Next, we will study some distributed strategies, such as the replication in fault detection. These strategies are marked in italics and divided into three parts:

1. data consistency. NoSQL requires a balance between consistency, fault tolerance, and performance of distributed systems, low latency, and high availability. Generally, data consistency is a required option, therefore, this section focuses on data replication and data recovery.
2. data placement. A database product should be able to cope with different data distribution, cluster topology and hardware configuration. In this section, we will discuss how to distribute and adjust the data distribution so that faults can be solved in a timely manner, and ensure persistence, efficient query and guarantee of cluster resources (such as memory and hard disk space) balanced usage.
3. Peer-to-Peer System. Technologies such as leader election have been used in multiple database products to achieve fault tolerance and strong data consistency. However, even distributed databases (with no center) need to track their global states and detect faults and topology changes. This section describes several techniques for keeping the system in a consistent state.

Data Consistency

As we all know, distributed systems often encounter network isolation or latency. In this case, the isolated part is unavailable. Therefore, it is impossible to maintain high availability without sacrificing consistency. This fact is often referred to as the "CAP theory ". However, consistency is very expensive in distributed systems, so we often need to make some concessions on it, not just for availability, but also for a variety of trade-offs. To study these trade-offs, we noticed that the consistency problem of distributed systems is caused by data isolation and replication, so we will start from studying the characteristics of replication:

  • Availability. When the network is isolated, the rest can still respond to read/write requests.
  • Read/write latency. Read/write requests can be processed in a short time.
  • Read/write scalability. The read/write pressure can be evenly shared by multiple nodes.
  • Fault Tolerance. The processing of read/write requests does not depend on any specific node.
  • Data persistence. Node faults under certain conditions will not cause data loss.
  • Consistency.Consistency is much more complex than the previous features. We need to discuss several different points of view in detail. However, we will not involve too many consistency theories and concurrency models, because this is beyond the scope of this article. I will only use a streamlined system with some simple features.
  • Read/write consistency.From the perspective of reading and writing, the basic goal of the database is to make the replica convergence time as short as possible (that is, the time when the update is passed to all copies) to ensure final consistency. In addition to this weak guarantee, there are some more consistent features:
  • Read consistency after writing.Write operations on data item X are always visible to subsequent read operations on data item X.
  • Read consistency after reading.After a read operation on data item X, subsequent read operations on data item X should return the same or newer value as the first return value.
  • Write consistency.Partition databases often encounter write conflicts. The database should be able to handle such conflicts and ensure that multiple write requests are not processed by different partitions. In this regard, the database provides several different consistency models:
  • Atomic write.If the database provides APIs, one write operation can only be an atomic value assignment. To avoid write conflicts, find the "latest version" of each data ". This allows all nodes to obtain the same version at the end of the update, regardless of the update sequence. network faults and latencies often result in different node update sequence. Data versions can be expressed by timestamps or user-specified values. Cassandra uses this method.
  • Atomic read-Modify-write.An application sometimes needs to perform read-Modify-write sequence operations instead of independent atomic write operations. Assume that two clients read data of the same version, write the modified data back, and write the data according to the atomic write model. The later update will overwrite the previous one. This behavior is incorrect in some cases (for example, two clients add new values to the same list value ). The database provides at least two solutions:
  • Conflict prevention.Read-Modify-write can be considered as a transaction in special circumstances. Therefore, consistent protocols such as distributed locks and PAXOS can solve this problem. This technology supports atomic read rewriting semantics and arbitrary isolation level transactions. Another method is to avoid distributed concurrent write operations and route all write operations on specific data items to a single node (either the global master node or the partition master node ). To avoid conflicts, the database must sacrifice the availability in the case of network isolation. This method is often used in many systems that provide strong consistency assurance (such as most relational databases, HBase, and MongoDB ).
  • Conflict Detection.The Database tracks the conflicts of concurrent updates, and chooses to roll back one of them or maintain two versions to the client for resolution. Concurrent updates are usually tracked using a vector clock (which is an optimistic lock), or a complete version history is maintained. This method is used for Riak, Voldemort, and CouchDB.

Now let's take a closer look at common Replication technologies and give them classes based on the characteristics described. The first figure depicts the logical relationship between different technologies and the trade-off between different technologies in terms of system consistency, scalability, availability, and latency. The second figure details each technology.



The duplicate factor is 4. The read/write Coordinator can be an external client or an internal proxy node.

We will repeat all the technologies from weak to strong based on consistency:

(A, anti-entropy) consistency is weakest, based on the following policy.During the write operation, select any node for update. If the new data is not transmitted to the read node through the backend anti-entropy protocol, the old data is still read. (The next section will detail the anti-entropy protocol ). The main features of this method are:

  • High propagation latency makes it less useful in data synchronization. Therefore, it is typically used to detect and repair unplanned inconsistencies only as an auxiliary function. Cassandra uses the antientropy algorithm to transmit the database topology and other metadata information between nodes.
  • Poor consistency guarantee: Write conflicts and read/write inconsistencies may occur even if no fault occurs.
  • High Availability and robustness under Network isolation. Asynchronous batch processing replaces updates one by one, which improves performance.
  • Durability guarantee is weak because new data only has a single copy at first.

(B) An improvement in the above mode is that an update is asynchronously sent to all available nodes when any node receives an update request. This is also considered a targeted inverse entropy.

  • Compared with pure antientropy, this method greatly improves consistency with only a small performance sacrifice. However, formal consistency and persistence remain unchanged.
  • If some nodes are unavailable because of network faults or node failures, the update will be transmitted to the node through the Anti-entropy propagation process.

(C) In the previous mode, use the prompt transfer technology to better handle failed operations on a node.The expected update of the failed node is recorded on the additional proxy node, and indicates that the update will be passed to the node once the feature node is available. This improves consistency and reduces the replication convergence time.

(D, one-time read/write) because the responsible node that prompts the transfer may also have expired before the update is passed out, it is necessary to ensure consistency through the so-called read fix in this case.Each read operation starts an asynchronous process and requests a data digest (such as signature or hash) to all nodes that store the data ), if the summary returned by each node is inconsistent, the data versions on each node are unified. We use the one-time read/write naming to combine A, B, C, and D Technologies-they do not provide strict consistency guarantees, but they can be used as A self-prepared method for practice.

(E, read several writes) the above policy is to reduce the heuristic enhancement of the replication convergence time.To ensure higher consistency, availability must be sacrificed to ensure a certain degree of read/write overlap. The common practice is to write W replicas at the same time instead of one, and read R replicas at the same time.

  • First, you can configure the number of write replicas W> 1.
  • Second, because R + W> N, the written and read nodes must overlap, therefore, at least one of the read data copies is relatively new (W = 2, R = 3, N = 4 in the figure above ). In this way, the read/write consistency can be ensured when the read/write requests are executed in sequence (after the write is executed, the read/write consistency of a single user is ensured), but the global read consistency cannot be guaranteed. As shown in the following figure, R = 2, W = 2, N = 3, because the update of two replicas is non-transactional, when the update is incomplete, the read may read two old values or the new and old values:


  • For certain read latency requirements, setting different values of R and W can adjust the write latency and durability, and vice versa.
  • If W <= N/2, multiple concurrent writes are written to different nodes (for example, N/2 before write, n/2 after B is written ). Setting W> N/2 ensures that conflicts are detected in a timely manner when atomic read rewriting meets the rollback model.
  • Strictly speaking, although this mode can tolerate the failure of individual nodes, it is not good for the fault tolerance of network isolation. In practice, we often use the "approximate quantity" Method to Improve availability in some scenarios by sacrificing consistency.

(F) read consistency issues can be mitigated by accessing all copies (Reading data or checking summaries) during data reading.This ensures that the new data on at least one node can be viewed by the reader. However, in the case of network isolation, this guarantee does not work.

(G, Master/Slave) is often used to provide read rewriting at the level of atomic write or conflict detection persistence. To achieve the level of conflict prevention, you must use a centralized management method or lock.The simplest strategy is to use master-slave asynchronous replication. Write operations on specific data items are routed to a central node and executed sequentially. In this case, the master node becomes a bottleneck, so data must be divided into independent regions (different slices have different masters) to provide scalability.

(H, Transactional Read Quorum Write Quorum and Read One Write All) to update multiple replicas, you can use transaction control technology to avoid Write conflicts.The common method is to use a two-phase commit protocol. However, the two-phase submission is not completely reliable, because invalid coordinators may cause resource blocking. PAXOS submission protocol is a more reliable option, but it will lose some performance. On this basis, the next small step is to read a copy and write all the copies. This method puts the updates of all the copies in a transaction, it provides strong fault tolerance consistency, but will lose some performance and availability.

It is necessary to emphasize the trade-offs in the above analysis.
  • Consistency and availability. Strict trade-offs have been given by the CAP theory. In the case of network isolation, databases must either centralize data or accept the risk of data loss.
  • Consistency and scalability. Even though read/write consistency reduces the scalability of replica sets, only the atomic write model can handle write conflicts in a relatively scalable way. The atomic read rewrite Model adds temporary global locks to the data to avoid conflicts. This indicates that the dependency between data or operations, even in a very small range or a short time, will damage the scalability. Therefore, it is very important to carefully design a data model and store data in parts separately for scalability.
  • Consistency and latency. As mentioned above, when the database needs to provide strong consistency or durability, it should be biased towards reading and writing all copies. However, it is obvious that the consistency is inversely proportional to the request latency. Therefore, using several replicas will be a relatively acceptable method.
  • Failover and consistency/scalability/delay. Interestingly, the conflict between fault tolerance and consistency, scalability, and latency is not intense. By reasonably dropping some performance and consistency, the cluster can tolerate failure of up to nodes. This compromise is evident in the difference between the two-phase commit and the PAXOS protocol. Another example of this compromise is to increase specific consistency protection. For example, the use of strict session processes as "read by yourself" increases the complexity of failover.
Anti-entropy protocol and rumor Propagation Algorithm

Let's start with the following scenarios:

There are many nodes, and each piece of data will have copies on several of them. Each node can process the update request separately, and each node regularly synchronizes with other nodes. After a period of time, all the copies will be consistent. How is the synchronization process performed? When does synchronization start? How do I select a synchronization object? How to exchange data? We assume that the two nodes always use a newer version of data to overwrite the old data or that both versions are retained for processing at the application layer.

This problem is common in scenarios such as Data Consistency Maintenance and cluster status synchronization (such as cluster member information dissemination. Although the introduction of a database monitoring and synchronization plan coordinator can solve this problem, decentralized databases can provide better fault tolerance. The main approach to decentralization is to use a well-designed infectious protocol, which is relatively simple but provides a good convergence time and can tolerate failure of any node and network isolation. Although there are many types of infectious algorithms, we only focus on the anti-entropy protocol, because NoSQL databases are using it.

The anti-entropy Protocol assumes that the synchronization will be executed according to a fixed schedule. Each node is randomly selected or another node is selected to exchange data according to certain rules to eliminate the difference. There are three anti-entropy protocols: Push, pull, and hybrid. The principle of the push protocol is to simply select a random node and send the data status to it. It is silly to push all the data out of real applications, so nodes generally work in the way shown.


Node A prepares A Data Abstract As the synchronization initiator, which contains the data fingerprint of node. After receiving the summary, Node B compares the data in the summary with the local data and returns the data difference as A summary to node. Finally, A sends an update to B and B updates the data. The pull and hybrid protocols are similar, as shown in.

The anti-entropy protocol provides sufficient convergence time and scalability. Displays the simulation results of an update propagation in a cluster with 100 nodes. In each iteration, each node is associated with only one randomly selected peer node.


We can see that the pull method has better convergence than the push method, which can be proved theoretically. There is also a problem of "tail convergence" in the push mode. After multiple iterations, although almost all nodes are traversed, few of them are not affected. Compared with the simple push and pull method, the hybrid method is more efficient, so this method is usually used in practical applications. The inverse entropy is scalable because the average conversion time increases in the form of a logarithm function of the cluster Scale.

Although these technologies seem simple, there are still many studies focusing on the performance of anti-entropy protocols under different constraints. One of them replaces random selection with a more effective structure using a network topology. Adjust the transmission rate or use advanced rules to select the data to be synchronized when the network bandwidth is limited. Abstract computing is also facing challenges. The database maintains a recently updated log to facilitate abstract computing.

Eventually Consistent Data type Eventually Consistent Data Types

In the previous section, we assume that the two nodes always merge their data versions. But it is not easy to solve the update conflict, so it is unexpected to make all copies reach a semantic correct value. A well-known example is that deleted entries in the Amazon Dynamo database can be reproduced.

Let's assume an example to illustrate this problem: the database maintains a logical global counter, and each node can increase or decrease the count. Although each node can maintain its own value locally, these local counts cannot be combined by simply adding or subtracting them. Assume that there are three nodes A, B, and C. Each node performs an add operation. If A obtains A value from B and adds it to the local copy, then C obtains the value from B, and then C obtains the value from A, then C's last value is 4, this is wrong. To solve this problem, we use a data structure similar to the vector clock to maintain a pair of counters for each node:

class Counter {
   int[] plus
   int[] minus
   int NODE_ID

   increment() {
     plus[NODE_ID]++
   }

  decrement() {
    minus[NODE_ID]++
  }

  get() {
    return sum(plus) – sum(minus)
  }

  merge(Counter other) {
    for i in 1..MAX_ID {
      plus[i] = max(plus[i], other.plus[i])
      minus[i] = max(minus[i], other.minus[i])
    }
  }
 }

Cassandra uses a similar method to count. The State-based or operation-based replication theory can also be used to design a more complex and eventually consistent data structure. For example, we mentioned a series of such data structures, including:

  • Counter (addition/subtraction)
  • Set (add and remove operations)
  • Graph (adding edges or vertices, removing edges or vertices)
  • List (insert a location or remove a location)

The final consistency Data Type features are usually limited, but it also brings additional performance overhead.

Data placement

This part focuses on algorithms that control data placement in distributed databases. These algorithms map data items to appropriate physical nodes, migrate data between nodes, and globally allocate resources such as memory.

Balanced data

We start with a simple protocol that provides seamless data migration between cluster nodes. This often happens in scenarios such as cluster resizing (adding new nodes), Failover (some nodes are down), or balanced data (unbalanced distribution of data between nodes. As described in scenario A, there are three nodes, and the data is randomly distributed among the three nodes (assuming that the data is of the key-value type ).


If the database does not support internal data balancing, you must publish database instances on each node, as shown in Figure B above. This requires manual cluster expansion, stopping the database instance to be migrated, transferring it to the new node, and then starting it on the new node, as shown in C. Although the database can monitor every record, including MongoDB, Oracle Coherence, and many systems that are still developing Redis clusters, automatic balancing technology is still used. That is to say, it is based on efficiency to partition data and use each data segment as the smallest unit of migration. Obviously, the number of shards is more than the number of nodes. Data shards can be evenly distributed among nodes. Seamless data migration can be achieved using a simple protocol. This protocol can redirect the customer's data to be migrated out of the node and migrated to the node during data sharding. Describes a state machine that implements the get (key) logic in a Redis Cluster.


Assuming that each node knows the cluster topology, it can map any key to the corresponding data shard and map the data Shard to the node. If the node determines that the requested key belongs to a local Shard, it will find it locally (the box above ). If the node determines that the request key belongs to another node X, it will send a permanent redirect command to the client (the box below ). Permanent redirection means that the client can cache the mappings between shards and nodes. If the multipart migration is in progress, the Migration node and the Migration node will mark the corresponding shards, lock the data in the shards one by one, and start moving. First, the Migration node will find the key locally. If the key is not found, redirect the client to the migration node, if the key has been migrated. Such redirection is one-time and cannot be cached. The migrated node processes redirection locally, but the regular query is permanently redirected before the migration is completed.

Data Partitioning and replication in a dynamic environment

Another issue we are concerned about is how to map records to physical nodes. The direct method is to use a table to record the ing between keys in each range and nodes. Keys in a range correspond to one node, or use the hash value of the key and the number of nodes as the node ID. However, the hash modulo method is not very useful when the cluster is changed, because increasing or decreasing nodes will cause the data in the cluster to be completely reshuffled. As a result, it is difficult to perform replication and fault recovery.

Many methods are enhanced in terms of replication and fault recovery. The most famous is consistent hash. There have been a lot of introductions on consistent hash on the Internet, so here I only provide a basic introduction, just for the integrity of the article content. It depicts the basic principle of consistent hash:


Consistent hash is basically a key-value ing structure-It maps keys (usually hash) to physical nodes. The value space after the key is hashed is an ordered fixed-length binary string, obviously, each key in this range is mapped to one of the three nodes A, B, and C in Figure. To copy a copy, the value space is closed into a ring, and the circle is clockwise until all copies are mapped to the appropriate node, as shown in B. In other words, Y will be located on Node B, because it is in the range of B, the first copy should be placed in C, the second copy is placed in A, and so on.

The benefit of this structure is that when a node is added or removed, it only causes the data in the adjacent area to be re-balanced. As shown in C, adding node D only affects data item X but does not affect Y. Similarly, removing Node B (or B is invalid) only affects copies of Y and X, and does not affect X itself. However, this approach brings both benefits and weaknesses, that is, the burden of rebalancing is borne by neighboring nodes, which move a large amount of data. By ing each node to multiple ranges rather than one range, the adverse effects of this problem can be mitigated to a certain extent, as shown in D. This is a compromise. It avoids excessive load concentration when data is re-balanced, but maintains a proper reduction in the total number of balances compared with the module-based ing.

It is not easy to maintain a complete and consistent hash ring for large-scale clusters. There is no problem with a relatively small database cluster. It is interesting to study how to combine data placement and network routing in a peer-to-peer network. A better example is the Chord algorithm, which compromises the integrity of the ring to the search efficiency of a single node. The Chord algorithm also uses the concept of ring ing keys to nodes, which is similar to consistent hash. The difference is that a specific node maintains a short list, and the logical location of the nodes in the list on the ring increases exponentially (for example ). This allows you to use binary search to locate a key only after several network jumps.


This figure shows A cluster composed of 16 nodes, which depicts how node A searches for keys placed on node D. (A) depicts routes, and (B) depicts local images of the ring targeting nodes A, B, and C. The references include:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.