A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Original: http://juliashine.com/distributed-algorithms-in-nosql-databases/A distributed algorithm for NoSQL databasesOn November 9, 2012 in also for rice beam, by Juliashine
This article was translated from distributed algorithms in NoSQL Databases
The scalability of the system is the main reason for the development of the NoSQL movement, including distributed system coordination, failover, resource management and many other features. That makes NoSQL sound like a big basket, and everything can be plugged in. Although the NoSQL movement does not bring fundamental technological changes to distributed data processing, it still leads to extensive research and practice on various protocols and algorithms. It is through these attempts to gradually summed up a number of effective methods of database construction. In this article, I'm going to make some systematic descriptions of the distributed features of NoSQL databases.
Next we will look at some of the distributed strategies, such as replication in fault detection, which are marked in boldface and divided into three segments:
It is well known that distributed systems often encounter network isolation or latency, in which case the isolated portions are not available, so it is not possible to maintain high availability without sacrificing consistency. This fact is often referred to as the "cap theory". However, consistency is a very expensive thing in a distributed system, so it is often necessary to make some concessions on it, not just for usability, but also for many tradeoffs. To study these tradeoffs, we note that the consistency of distributed systems is caused by data isolation and replication, so we'll start with the characteristics of the replication:
Now let's take a closer look at the common replication techniques and give them a class according to the characteristics described. The first diagram depicts the logical relationship between different technologies and the trade-offs between different technologies in system consistency, extensibility, usability, and latency. The second picture depicts each technique in detail.
The replica factor is 4. The read-write coordinator can be an external client or an internal proxy node.
We will go through all the techniques according to consistency from weak to strong:
Some of the tradeoffs in the above analysis need to be emphasized again:
Let's start with the following scenarios:
There are many nodes, each of which holds a copy on several of the nodes. Each node can handle update requests separately, each node periodically synchronizing with the other nodes, so that all replicas will be consistent over time. How does the synchronization process take place? When does synchronization start? How do I select a Synchronized object? How to Exchange data? We assume that two nodes always overwrite the old data with the newer version of the data or two versions are reserved for application layer processing.
This problem is common in scenarios such as data consistency maintenance and cluster State synchronization, such as cluster member information propagation. While a coordinator who introduces a monitoring database and develops a synchronization plan can solve this problem, a centralized database can provide better fault tolerance. The main practice of de-centering is to use a well-designed infection protocol , which is relatively simple, but provides good convergence time, and can tolerate any node failure and network isolation. Although there are many types of infection algorithms, we only focus on the anti-entropy protocol because NoSQL databases are using it.
The inverse entropy protocol assumes that synchronization is performed on a fixed schedule, with each node periodically randomly or in accordance with a rule to select another node to Exchange data, eliminating differences. There are three anti-style anti-entropy protocols: Push, pull and mix. The principle of the push protocol is simply to select a random node and send the data status to the past. It is obviously foolish to push all the data out in real-world applications, so nodes generally work in the way shown.
Node A prepares a summary of the data as a synchronization initiator, which contains the fingerprint of the data on a. Node B compares the data in the digest to the local data after it receives the digest and returns the data variance to a as a summary. Finally, a sends an update to the B,B to update the data. The pull-mode and mixed-mode protocols are similar to this, as shown in.
The anti-entropy protocol provides good convergence time and scalability. Shows a simulated result that propagates an update in a 100-node cluster. In each iteration, each node is contacted only by a randomly selected peer node.
It can be seen that the convergence of the pull mode is better than the push way, which can be proved theoretically . And there is a problem of "convergent tail" in the way of pushing. After many iterations, although almost all of the nodes were traversed, a small percentage of them were not affected. Blending is more efficient than a simple push-and-pull approach, so this is commonly used in practical applications. The inverse entropy is extensible, because the average conversion time grows in the form of a logarithmic function of the cluster size.
Although these technologies appear to be simple, there are still many studies focused on the performance of anti-entropy protocols under different constraints. One of these uses a more efficient structure to replace random selection  by using a network topology. Adjust the transfer rate with limited network bandwidth or use advanced rules to select the data to synchronize . Summary calculations also face challenges, and the database maintains a recently updated log to help with summary calculations.Final consistent data type eventually consistent-Types
In the previous section we assumed that two nodes would always merge their data versions. But it is not easy to resolve the update conflict, so it is surprisingly difficult to make all replicas end up with a semantically correct value. A well-known example of an entry that has been deleted in the Amazon Dynamo database  can be reproduced.
Let's assume an example to illustrate the problem: the database maintains a logical global counter, and each node can increase or decrease the count. Although each node can maintain its own values locally, these local counts cannot be combined by simple addition and subtraction. Suppose such an example: There are three nodes A, B, and C, and each node performs a single add operation. If a obtains a value from B and adds it to the local copy, then C obtains the value from B, and then C obtains the value from a, then the last value of C is 4, which is wrong. The solution to this problem is to maintain a pair of counters for each node using a data structure similar to the vector clock  :
Cassandra is counted in a similar way . Using state-based or operation-based replication theory can also design more complex and ultimately consistent data structures. For example,  mentions a series of such data structures, including:
The functionality of the final, consistent data type is often limited and provides additional performance overhead.Data placement
This section focuses on algorithms that control the placement of data in distributed databases. These algorithms are responsible for mapping data items to appropriate physical nodes, migrating data between nodes, and global provisioning of resources such as memory.Balanced data
We're still starting with a simple protocol that provides seamless data migration between cluster nodes. This often occurs in scenarios such as cluster expansion (joining a new node), failover (some node downtime), or balanced data (data is unevenly distributed across nodes). As depicted in a scenario – there are three nodes, the data is randomly distributed across three nodes (assuming the data are key-value type).
If the database does not support data internal equalization, publish the DB instance on each node, as shown in Figure B above. This requires a manual cluster extension, stopping the DB instance to be migrated, transferring it to the new node, and starting on the new node, as shown in C. Although the database is capable of monitoring every record, including MongoDB, Oracle Coherence, and the Redis Cluster in development, many systems still use automatic equalization technology. That is, the data is fragmented and each data shard as the smallest unit of migration, based on efficiency considerations. It is obvious that the number of shards is more than the number of nodes, and the data shards can be distributed evenly among the nodes. Seamless data migration can be achieved with a simple protocol that redirects the client's data to the migration node and to the moving node when migrating data shards. Describes a state machine that implements the Get (key) logic in a Redis cluster.
Assuming that each node knows the cluster topology, it can map any key to the corresponding data shard and map the data shards to the nodes. If the node determines that the requested key belongs to the local shard, it will be found locally (in the box above). If the node determines that the requested key belongs to another node x, he sends a permanent REDIRECT command to the client (the box below). Permanent redirection means that the client can cache mappings between shards and nodes. If the Shard migration is in progress, the move-out node and the move-in node will mark the corresponding Shard and lock the Shard's data into a lock and then start moving. The move out node will first find the key locally, if not found, redirect the client to the migration node, if key has been migrated. This redirection is disposable and cannot be cached. The migration node handles the redirection locally, but the periodic query is permanently redirected before it is completed.Data fragmentation and replication in a dynamic environment
Another issue we are concerned with is how to map records to physical nodes. The direct method is to use a table to record each range of key and node mapping relationship, a range of keys corresponding to a node, or the hash value of key and the number of nodes modulo the resulting value as the node ID. However, the method of hash modulus is not very useful when the cluster changes, because adding or reducing nodes will cause the data in the cluster to be completely re-queued. Makes replication and recovery difficult.
There are many ways to increase the angle of replication and failure recovery. The most famous is the consistent hash. There has been a lot of introduction to consistent hashing on the web, so here I only provide a basic introduction, just for the completeness of the article content. Describes the basic principles of consistent hashing:
A consistent hash is fundamentally a key-value mapping structure – it maps a key (usually a hash) to a physical node. The value space after the key has been hashed is an ordered fixed-length binary string, and it is clear that each key within this range is mapped to one of the three nodes in a, B, and C in figure A. For replica copying, the value space is closed into a ring, along the loop clockwise until all replicas are mapped to the appropriate node, as shown in B. In other words, Y will be positioned on Node B because it is within the range of B, the first copy should be placed in C, the second copy is placed in a, and so on.
The benefit of this structure is in the case of increasing or decreasing a node, as it will only cause data re-equalization in the pro-access region. As shown in C, the addition of node D will only affect the data item x and have no effect on Y. Similarly, removing node B (or B failure) only affects copies of Y and X, without affecting the x itself. However, as mentioned in reference , this approach has the advantage of having a weakness that the burden of rebalancing is borne by neighboring nodes, and they move large amounts of data. By mapping each node to multiple scopes instead of a scope can mitigate the adverse effects of this problem, as shown in D. This is a tradeoff that avoids the load being too concentrated when rebalancing data, but maintains a proper reduction in the total balance amount compared to the module-based mapping.
It is not easy to maintain a complete and coherent hash ring for a large-scale cluster. There is no problem with a relatively small database cluster, and it is interesting to study how to combine the placement of data with network routing in a peer network. A good example is the chord algorithm, which allows the integrity of the ring to be compromised by the lookup efficiency of a single node. The chord algorithm also uses a ring-mapped key-to-node concept, which is similar to a consistent hash. The difference is that a particular node maintains a short list of nodes in the list where the logical position on the ring is exponential (for example). This makes it possible to use a binary search to locate a key with only a few network jumps.
This picture is a cluster of 16 nodes that depicts how node a finds the key that is placed on node D. (A) depicts the route, (b) depicts a local image of the ring against nodes A, B, and C. More information about data replication in decentralized systems is available in reference .Data sharding by multiple attributes
A consistent hash of the data placement strategy is effective when only a primary key is needed to access the data, but it is much more complex to query with multiple attributes. A simple approach (MongoDB uses) is to use primary keys to distribute data regardless of other attributes. The result is that queries based on the primary key can be routed to the appropriate node, but the processing of the other queries will traverse all nodes of the cluster. The imbalance in query efficiency causes the following problems:
There is a dataset in which each piece of data has several properties and corresponding values. Is there a data distribution strategy that enables queries that limit any number of attributes to be delivered to as few nodes as possible?
The Hyperdex database provides a solution. The basic idea is to treat each attribute as an axis in a multidimensional space and map the area in the space to the physical node. Once a query is mapped to a hyper-plane that consists of multiple contiguous areas of space, only those areas are related to the query. Let's look at an example in reference :
Each piece of data is a user information, with three properties first name, last name, and phone number. These attributes are considered to be a three-dimensional space, and a feasible data distribution strategy is to map each quadrant to a physical node. Queries such as "First Name = John" correspond to a plane that runs through 4 quadrants, or that only 4 nodes are involved in processing the query. A query with two attribute limits corresponds to a line running through two quadrants, as shown, so only 2 nodes are involved in processing.
The problem with this approach is that the spatial quadrant will grow exponentially as the number of attributes. As a result, only a few attribute-constrained queries can be projected into many spatial areas, or many servers. This problem can be mitigated to some extent by splitting a data item with more attributes into a few sub-items with relatively few properties and mapping each subkey to a separate subspace instead of mapping the entire data to a multidimensional space:
This provides better mapping of query-to-node, but increases the complexity of cluster coordination, because in this case a single piece of data is scattered across multiple separate sub-spaces, each of which corresponds to its own number of physical nodes, and the data must be updated with transaction issues in mind. Reference  has more introduction and implementation details for this technique.passivated copy
Some applications have very strong random read requirements, which requires putting all the data in memory. In this case, slicing the data and copying the master-slave copy of each shard typically requires more than twice times as much memory, since each data will have one copy on both the master node and the slave node. In order to replace the primary node when it fails, the memory size from the node should be the same as the primary node. If the system can tolerate a temporary outage or performance degradation when the node fails, it is also not possible to Shard.
The following figure depicts 16 shards on 4 nodes, each with one copy in memory and a copy on the hard disk:
The gray arrows highlight the Shard copy on Node 2. Shards on other nodes are also replicated. The Red Arrows depict how the replicas are loaded into memory in the event of Node 2 failure. The uniform distribution of replicas within the cluster makes it possible to store a replica that is activated in the event of a node failure by reserving very little memory. In the above figure, the cluster only reserves 1/3 of the memory to withstand the failure of a single node. It is particularly noted that the activation of the replica (loaded into memory from the hard disk) can take some time, which can result in a short performance degradation or a portion of the data service that is recovering from the outage.System coordination
In this section we will discuss two techniques related to system coordination. Distributed coordination is a relatively large area, and many people have studied it in depth for decades. This article only covers two types of technologies that have been put into practice. With regard to distributed locks, the consensus protocol and other basic technologies can be found in many books or Web resources, and can be viewed in reference materials [17, 18, 21].Fault detection
Fault detection is the basic function of any distributed system with fault tolerance. In fact, all the fault detection protocols are based on the heartbeat communication mechanism, the principle is very simple, the monitored components regularly send heartbeat information to the monitoring process (or by the monitoring process polling monitored components), if not received a heartbeat information for a period of time is considered invalid. In addition, there are other functional requirements for a real distributed system:
The so-called cumulative failure detector  can solve the first two problems, cassandra It has been modified and applied to the product. The basic workflow is as follows:
According to the importance of different levels of organizational monitoring area, between the regions through the rumor spread protocol or the central fault-tolerant database synchronization, so as to meet the requirements of scalability, but also to prevent the heartbeat information flooding the network . As shown (6 fault detectors make up two zones, communicate with each other through a rumor propagation protocol or a robust library like zookeeper):Coordinator campaign
The Coordinator campaign is an important technology for strong consistency databases. First, it can organize the fault recovery of master node in the system of master-slave structure. Second, in the case of network isolation, it can be disconnected in a few nodes, to avoid write conflicts.
The Bully algorithm is a relatively simple campaign algorithm for coordinators. MongoDB uses this algorithm to determine the main one in the replica set. The main idea of the Bully algorithm is that each member of the cluster can declare that it is the coordinator and notifies the other nodes. Other nodes can choose to accept this claim or reject it and enter the coordinator competition. Nodes that are accepted by all other nodes can become coordinators. The node follows some attributes to determine who should win. This property can be a static ID, or it can be an updated metric like the last transaction ID (the newest node wins).
Example shows the execution process of the bully algorithm. With a static ID as a measure, a node with a large ID value wins:
The Coordinator election process counts the number of nodes involved and ensures that at least half of the nodes in the cluster participate in the campaign. This ensures that in the case of network isolation only a subset of the nodes can select the coordinator (assuming that the network will be divided into a number of areas, the result is not connected to each other, the results of the Coordinator election will inevitably be in the relatively more nodes in the area of the selection of the coordinator, of course, if the region is more than half the number of nodes If the cluster is isolated into chunks, and no chunk has more nodes than half of the total number of nodes in the original node, then the coordinator cannot be elected and, of course, the cluster will not be expected to continue to serve.Resources
[Reproduced] A distributed algorithm for NoSQL databases
Start building with 50+ products and up to 12 months usage for Elastic Compute Service