Original: http://juliashine.com/distributed-algorithms-in-nosql-databases/
A distributed algorithm for NoSQL databasesOn November 9, 2012 in also for rice beam, by Juliashine
This article was translated from distributed algorithms in NoSQL Databases
The scalability of the system is the main reason for the development of the NoSQL movement, including distributed system coordination, failover, resource management and many other features. That makes NoSQL sound like a big basket, and everything can be plugged in. Although the NoSQL movement does not bring fundamental technological changes to distributed data processing, it still leads to extensive research and practice on various protocols and algorithms. It is through these attempts to gradually summed up a number of effective methods of database construction. In this article, I'm going to make some systematic descriptions of the distributed features of NoSQL databases.
Next we will look at some of the distributed strategies, such as replication in fault detection, which are marked in boldface and divided into three segments:
- Data consistency. NoSQL requires a tradeoff between distributed system consistency, fault tolerance and performance, low latency, and high availability, in general, data consistency is a mandatory option, so this section is about data replication and data recovery .
- Data placement. A database product should be able to handle different data distributions, cluster topologies, and hardware configurations. In this section we will discuss how to distribute and adjust the distribution of data to be able to resolve failures in a timely manner, provide persistent guarantees, efficiently query and ensure balanced use of resources in training, such as memory and hard disk space.
- Peer System. Technologies such as leader election have been used in multiple database products to achieve fault tolerance and data strong consistency. However, even distributed databases (without hubs) also track their global state, detecting failures and topological changes. This section describes several techniques for keeping your system in a consistent state. System Coordination. Coordination techniques likeleader election is used in
Data consistency
It is well known that distributed systems often encounter network isolation or latency, in which case the isolated portions are not available, so it is not possible to maintain high availability without sacrificing consistency. This fact is often referred to as the "cap theory". However, consistency is a very expensive thing in a distributed system, so it is often necessary to make some concessions on it, not just for usability, but also for many tradeoffs. To study these tradeoffs, we note that the consistency of distributed systems is caused by data isolation and replication, so we'll start with the characteristics of the replication:
- Usability. In the case of network isolation, the remainder can still respond to read-write requests.
- Read/write Latency. Read and write requests can be processed in a short period of time.
- Read and write Extensibility. The pressure of reading and writing can be shared evenly by multiple nodes.
- Fault tolerance. The processing of a read-write request does not depend on any particular node.
- Persistence of data. Node failures under certain conditions do not result in data loss.
- Consistency. Consistency is much more complex than the previous features, and we need to discuss a few different points of view in detail. But we're not going to involve too many consistency theories and concurrency models, because this is beyond the scope of this article, and I'll just use a few simple features that make up a streamlined system.
- Read/write consistency. From a read-write standpoint, the basic goal of the database is to make the replica converge as short as possible (i.e., the time that the update is passed to all replicas) to ensure eventual consistency. In addition to this weaker guarantee, there are some stronger consistency features:
- Write after read consistency. The effect of the write operation on the data item X is always visible on subsequent x read operations.
- Read-after-read consistency. After a read of the data item x, subsequent reads of x should return the same or a newer value as the first return value.
- Write consistency. A write conflict often occurs in a partitioned database. The database should be able to handle this conflict and ensure that multiple write requests are not handled by different partitions. In this respect, the database provides several different consistency models:
- Atom writes. If the database provides an API, a write operation can only be a single atomic assignment, and the way to avoid writing conflicts is to find the "latest version" of each data. This allows all nodes to obtain the same version at the end of the update, regardless of the order in which the updates are made, and network failures and delays often result in inconsistent sequence of update sequences for each node. The data version can be represented by a timestamp or a user-specified value. That's the way Cassandra used it.
- atomized read-and-write. Applications sometimes require a read-write sequence operation rather than a separate atomic write operation. If two clients read the same version of the data, modify and write the modified data back, according to the atomic write model, the time to compare the next update will overwrite the previous one. This behavior is incorrect in some cases (for example, two clients add a new value to the same list value). The database provides at least two workarounds:
- Conflict prevention. Read-change-write can be considered a special case of a transaction, so a distributed lock or PAXOS [20, 21] Such a consistent protocol can solve this problem. This technique supports atomic read rewriting semantics and arbitrary isolation levels of transactions. Another approach is to avoid distributed concurrent writes, which route all writes to a particular data item to a single node (either the Global primary node or the partition master node). To avoid conflicts, the database must sacrifice the availability of network isolation. This approach is often used in many systems that provide strong consistency guarantees (for example, most relational databases, HBASE,MONGODB).
- Conflict detection. The database tracks concurrent update conflicts and chooses to roll back one of them or maintain two versions to be resolved by the client. Concurrent updates are typically tracked using the vector clock [19] (which is an optimistic lock), or maintain a full version history. This method is used for Riak, Voldemort, CouchDB.
Now let's take a closer look at the common replication techniques and give them a class according to the characteristics described. The first diagram depicts the logical relationship between different technologies and the trade-offs between different technologies in system consistency, extensibility, usability, and latency. The second picture depicts each technique in detail.
The replica factor is 4. The read-write coordinator can be an external client or an internal proxy node.
We will go through all the techniques according to consistency from weak to strong:
- (A, anti-entropy) is the weakest consistency, based on the following policy. When the write operation chooses any one node update, when reads the time if the new data has not passed through the background anti-entropy protocol to the Reading node, then reads still the old data. (The anti-entropy protocol is described in detail in the next section.) The main features of this approach are:
- The high propagation delay makes it less useful in data synchronization, so a more typical usage is to detect and fix unplanned inconsistencies only as an accessible feature. Cassandra uses the inverse entropy algorithm to pass the database topology and some other meta-data information between the nodes.
- The consistency guarantee is weak: write conflicts and read-write inconsistencies occur even in the absence of a failure.
- High availability and robustness under network isolation. Asynchronous batching replaces one-by-one updates, which makes performance excellent.
- The durability guarantee is weak because the new data initially has only a single copy.
- (B) An improvement to the above pattern is to asynchronously send updates to all available nodes at the same time as any one node receives the update data request. This is also considered to be directional anti-entropy.
- Compared with pure anti-entropy, this approach greatly improves consistency with a small performance sacrifice. However, formal consistency and permanence remain unchanged.
- If some nodes are not available at the time because of network failure or node failure, the update will eventually pass through the anti-entropy propagation process to that node.
- (C) in the previous mode, the use of the hint transfer technique [8] could better handle a node's operation failure. The expected update for the failed node is recorded on the additional proxy node, and it is indicated that the update is passed to the node as soon as the feature node is available. This improves consistency and reduces replication convergence time.
- (D, one-time read-write) because the node of responsibility that prompts for handover is also likely to fail before the update is passed out, in which case it is necessary to ensure consistency through so-called read-fix. Each read operation initiates an asynchronous process that requests a digest of data (such as a signature or hash) from all nodes that store the data, and unifies the data version on each node if it finds that the digest returned by each node is inconsistent. We use one-time read and write to name technologies that combine a, B, C, D-none of them provide strict consistency guarantees, but as a self-contained method can already be used in practice.
- (E, read a number of write some) the above strategy is to reduce the replication convergence time heuristic enhancement. To ensure greater consistency, it is necessary to sacrifice availability to ensure a certain read-write overlap. It is common practice to write a W copy instead of one at a time, and read the R copy when reading.
- First, you can configure the number of write replicas w>1.
- Secondly, because of the r+w>n, there is bound to be overlap between the written node and the read node, so at least one of the multiple copies of the data read is relatively new (w=2, r=3, n=4) in the above figure. This ensures consistency (read-write consistency for individual users), but does not guarantee global read consistency, when read-write requests are sequential (read-write and reread). In the example shown in the following illustration, R=2,w=2,n=3, because the write operation for two copies of the update is non-transactional, when the update is not completed, it is possible to read two are old values or a new old:
- For the requirements of some kind of read delay, setting different values for R and W can adjust write latency and persistence, and vice versa.
- If W<=N/2, multiple writes are written to a different number of nodes (for example, write-a writes before N/2, B n/2 after writing). Setting W>N/2 ensures that conflicts are detected in a timely manner when the atomic read overwrite of the rollback model is met.
- Strictly speaking, this mode can tolerate the failure of individual nodes, but the fault tolerance for network isolation is not good. In practice, the "approximate quantity through" approach is often used to improve usability in some scenarios by sacrificing consistency.
- (F, read all write a few) read consistency issues can be mitigated by accessing all copies (read data or check summaries) while reading the data. This ensures that as long as the data on at least one node is updated, the new data can be seen by the reader. But in the case of network isolation, this guarantee is not going to work.
- (G, master-slave) This technique is often used to provide read rewriting of atomic write or conflict detection persistence levels. In order to achieve the level of conflict prevention, a centralized management or a lock must be used. The simplest strategy is to replicate asynchronously with master and slave. Writes for a particular data item are all routed to a central node and executed in the order above. In this case, the master node becomes a bottleneck, so it is necessary to divide the data into separate slices (different pieces have different master) to provide extensibility.
- (H, transactional Read Quorum write Quorum and read one write all) methods for updating multiple replicas can avoid write conflicts by using transaction control techniques. A well-known approach is to use the two-phase commit protocol. However, two-phase submissions are not completely reliable, as the coordinator failure can cause resource congestion. Paxos commit protocol [20, 21] is a more reliable choice, but it will lose a bit of performance. One small step forward on this basis is to read a copy of all copies, which puts updates of all the replicas in one transaction, which provides strong fault tolerance but loses some performance and availability.
Some of the tradeoffs in the above analysis need to be emphasized again:
- consistency and availability. the tight balance has been given by the CAP theory. In the case of network isolation, the database will either be in a dataset or accept the risk of data loss.
- consistency and extensibility. It can be seen that even if the read-write consistency guarantee reduces the extensibility of the replica set, the write conflicts are handled in a relatively extensible manner only in the atomic write model. The atomic read rewrite model avoids collisions by adding temporary global locks to the data. This indicates that the dependency between the data or the operation, even within a very small scope or for a short time, can damage extensibility. Therefore, it is very important to design the data model carefully and separate the data shards for extensibility.
- consistency and latency. as mentioned above, when a database needs to provide strong consistency or persistence, it should be biased to read and write all replica technologies. However, it is clear that consistency is inversely proportional to request latency, so using a number of replica techniques is a way to compare.
- failover and conformance/extensibility/latency. What's interesting is that fault tolerance and consistency, extensibility, and delayed trade-offs are not drastic. By reasonably abandoning some performance and consistency, the cluster can tolerate as many as up to node failures. This compromise is evident in the distinction between two-phase submissions and the PAXOS agreement. Another example of this tradeoff is the addition of specific consistency guarantees, such as "read-write" with a strict session process, but this adds to the complexity of failover [22].
Anti-entropy protocol, rumor propagation algorithm
Let's start with the following scenarios:
There are many nodes, each of which holds a copy on several of the nodes. Each node can handle update requests separately, each node periodically synchronizing with the other nodes, so that all replicas will be consistent over time. How does the synchronization process take place? When does synchronization start? How do I select a Synchronized object? How to Exchange data? We assume that two nodes always overwrite the old data with the newer version of the data or two versions are reserved for application layer processing.
This problem is common in scenarios such as data consistency maintenance and cluster State synchronization, such as cluster member information propagation. While a coordinator who introduces a monitoring database and develops a synchronization plan can solve this problem, a centralized database can provide better fault tolerance. The main practice of de-centering is to use a well-designed infection protocol [7], which is relatively simple, but provides good convergence time, and can tolerate any node failure and network isolation. Although there are many types of infection algorithms, we only focus on the anti-entropy protocol because NoSQL databases are using it.
The inverse entropy protocol assumes that synchronization is performed on a fixed schedule, with each node periodically randomly or in accordance with a rule to select another node to Exchange data, eliminating differences. There are three anti-style anti-entropy protocols: Push, pull and mix. The principle of the push protocol is simply to select a random node and send the data status to the past. It is obviously foolish to push all the data out in real-world applications, so nodes generally work in the way shown.
Node A prepares a summary of the data as a synchronization initiator, which contains the fingerprint of the data on a. Node B compares the data in the digest to the local data after it receives the digest and returns the data variance to a as a summary. Finally, a sends an update to the B,B to update the data. The pull-mode and mixed-mode protocols are similar to this, as shown in.
The anti-entropy protocol provides good convergence time and scalability. Shows a simulated result that propagates an update in a 100-node cluster. In each iteration, each node is contacted only by a randomly selected peer node.
It can be seen that the convergence of the pull mode is better than the push way, which can be proved theoretically [7]. And there is a problem of "convergent tail" in the way of pushing. After many iterations, although almost all of the nodes were traversed, a small percentage of them were not affected. Blending is more efficient than a simple push-and-pull approach, so this is commonly used in practical applications. The inverse entropy is extensible, because the average conversion time grows in the form of a logarithmic function of the cluster size.
Although these technologies appear to be simple, there are still many studies focused on the performance of anti-entropy protocols under different constraints. One of these uses a more efficient structure to replace random selection [10] by using a network topology. Adjust the transfer rate with limited network bandwidth or use advanced rules to select the data to synchronize [9]. Summary calculations also face challenges, and the database maintains a recently updated log to help with summary calculations.
Final consistent data type eventually consistent-Types
In the previous section we assumed that two nodes would always merge their data versions. But it is not easy to resolve the update conflict, so it is surprisingly difficult to make all replicas end up with a semantically correct value. A well-known example of an entry that has been deleted in the Amazon Dynamo database [8] can be reproduced.
Let's assume an example to illustrate the problem: the database maintains a logical global counter, and each node can increase or decrease the count. Although each node can maintain its own values locally, these local counts cannot be combined by simple addition and subtraction. Suppose such an example: There are three nodes A, B, and C, and each node performs a single add operation. If a obtains a value from B and adds it to the local copy, then C obtains the value from B, and then C obtains the value from a, then the last value of C is 4, which is wrong. The solution to this problem is to maintain a pair of counters for each node using a data structure similar to the vector clock [19] [1]:
- 1 class Counter {
- 2 int[] plus
- 3 int[] minus
- 4 int node_id
- 5
- 6 Increment() {
- 7 plus[node_id]++
- 8 }
- 9
- Ten decrement() {
- minus[node_id]++
- -- }
- 13
- + get() {
- return sum(plus) – sum(minus)
- + }
- 17
- merge(Counterother ) {
- For i in 1.. max_id {
- 20 Plus[i ] = Max ( Plus[i], Other.plus[i])
- minus[i] = Max(minus[i], other . Minus[i])
- + }
- (+ }
- + }
Cassandra is counted in a similar way [11]. Using state-based or operation-based replication theory can also design more complex and ultimately consistent data structures. For example, [1] mentions a series of such data structures, including:
- Counter (plus minus operation)
- Collection (Add and remove operations)
- Graph (add edges or vertices, remove edges or vertices)
- List (insert a position or remove a location)
The functionality of the final, consistent data type is often limited and provides additional performance overhead.
Data placement
This section focuses on algorithms that control the placement of data in distributed databases. These algorithms are responsible for mapping data items to appropriate physical nodes, migrating data between nodes, and global provisioning of resources such as memory.
Balanced data
We're still starting with a simple protocol that provides seamless data migration between cluster nodes. This often occurs in scenarios such as cluster expansion (joining a new node), failover (some node downtime), or balanced data (data is unevenly distributed across nodes). As depicted in a scenario – there are three nodes, the data is randomly distributed across three nodes (assuming the data are key-value type).
If the database does not support data internal equalization, publish the DB instance on each node, as shown in Figure B above. This requires a manual cluster extension, stopping the DB instance to be migrated, transferring it to the new node, and starting on the new node, as shown in C. Although the database is capable of monitoring every record, including MongoDB, Oracle Coherence, and the Redis Cluster in development, many systems still use automatic equalization technology. That is, the data is fragmented and each data shard as the smallest unit of migration, based on efficiency considerations. It is obvious that the number of shards is more than the number of nodes, and the data shards can be distributed evenly among the nodes. Seamless data migration can be achieved with a simple protocol that redirects the client's data to the migration node and to the moving node when migrating data shards. Describes a state machine that implements the Get (key) logic in a Redis cluster.
Assuming that each node knows the cluster topology, it can map any key to the corresponding data shard and map the data shards to the nodes. If the node determines that the requested key belongs to the local shard, it will be found locally (in the box above). If the node determines that the requested key belongs to another node x, he sends a permanent REDIRECT command to the client (the box below). Permanent redirection means that the client can cache mappings between shards and nodes. If the Shard migration is in progress, the move-out node and the move-in node will mark the corresponding Shard and lock the Shard's data into a lock and then start moving. The move out node will first find the key locally, if not found, redirect the client to the migration node, if key has been migrated. This redirection is disposable and cannot be cached. The migration node handles the redirection locally, but the periodic query is permanently redirected before it is completed.
Data fragmentation and replication in a dynamic environment
Another issue we are concerned with is how to map records to physical nodes. The direct method is to use a table to record each range of key and node mapping relationship, a range of keys corresponding to a node, or the hash value of key and the number of nodes modulo the resulting value as the node ID. However, the method of hash modulus is not very useful when the cluster changes, because adding or reducing nodes will cause the data in the cluster to be completely re-queued. Makes replication and recovery difficult.
There are many ways to increase the angle of replication and failure recovery. The most famous is the consistent hash. There has been a lot of introduction to consistent hashing on the web, so here I only provide a basic introduction, just for the completeness of the article content. Describes the basic principles of consistent hashing:
A consistent hash is fundamentally a key-value mapping structure – it maps a key (usually a hash) to a physical node. The value space after the key has been hashed is an ordered fixed-length binary string, and it is clear that each key within this range is mapped to one of the three nodes in a, B, and C in figure A. For replica copying, the value space is closed into a ring, along the loop clockwise until all replicas are mapped to the appropriate node, as shown in B. In other words, Y will be positioned on Node B because it is within the range of B, the first copy should be placed in C, the second copy is placed in a, and so on.
The benefit of this structure is in the case of increasing or decreasing a node, as it will only cause data re-equalization in the pro-access region. As shown in C, the addition of node D will only affect the data item x and have no effect on Y. Similarly, removing node B (or B failure) only affects copies of Y and X, without affecting the x itself. However, as mentioned in reference [8], this approach has the advantage of having a weakness that the burden of rebalancing is borne by neighboring nodes, and they move large amounts of data. By mapping each node to multiple scopes instead of a scope can mitigate the adverse effects of this problem, as shown in D. This is a tradeoff that avoids the load being too concentrated when rebalancing data, but maintains a proper reduction in the total balance amount compared to the module-based mapping.
It is not easy to maintain a complete and coherent hash ring for a large-scale cluster. There is no problem with a relatively small database cluster, and it is interesting to study how to combine the placement of data with network routing in a peer network. A good example is the chord algorithm, which allows the integrity of the ring to be compromised by the lookup efficiency of a single node. The chord algorithm also uses a ring-mapped key-to-node concept, which is similar to a consistent hash. The difference is that a particular node maintains a short list of nodes in the list where the logical position on the ring is exponential (for example). This makes it possible to use a binary search to locate a key with only a few network jumps.
This picture is a cluster of 16 nodes that depicts how node a finds the key that is placed on node D. (A) depicts the route, (b) depicts a local image of the ring against nodes A, B, and C. More information about data replication in decentralized systems is available in reference [15].
Data sharding by multiple attributes
A consistent hash of the data placement strategy is effective when only a primary key is needed to access the data, but it is much more complex to query with multiple attributes. A simple approach (MongoDB uses) is to use primary keys to distribute data regardless of other attributes. The result is that queries based on the primary key can be routed to the appropriate node, but the processing of the other queries will traverse all nodes of the cluster. The imbalance in query efficiency causes the following problems:
There is a dataset in which each piece of data has several properties and corresponding values. Is there a data distribution strategy that enables queries that limit any number of attributes to be delivered to as few nodes as possible?
The Hyperdex database provides a solution. The basic idea is to treat each attribute as an axis in a multidimensional space and map the area in the space to the physical node. Once a query is mapped to a hyper-plane that consists of multiple contiguous areas of space, only those areas are related to the query. Let's look at an example in reference [6]:
Each piece of data is a user information, with three properties first name, last name, and phone number. These attributes are considered to be a three-dimensional space, and a feasible data distribution strategy is to map each quadrant to a physical node. Queries such as "First Name = John" correspond to a plane that runs through 4 quadrants, or that only 4 nodes are involved in processing the query. A query with two attribute limits corresponds to a line running through two quadrants, as shown, so only 2 nodes are involved in processing.
The problem with this approach is that the spatial quadrant will grow exponentially as the number of attributes. As a result, only a few attribute-constrained queries can be projected into many spatial areas, or many servers. This problem can be mitigated to some extent by splitting a data item with more attributes into a few sub-items with relatively few properties and mapping each subkey to a separate subspace instead of mapping the entire data to a multidimensional space:
This provides better mapping of query-to-node, but increases the complexity of cluster coordination, because in this case a single piece of data is scattered across multiple separate sub-spaces, each of which corresponds to its own number of physical nodes, and the data must be updated with transaction issues in mind. Reference [6] has more introduction and implementation details for this technique.
passivated copy
Some applications have very strong random read requirements, which requires putting all the data in memory. In this case, slicing the data and copying the master-slave copy of each shard typically requires more than twice times as much memory, since each data will have one copy on both the master node and the slave node. In order to replace the primary node when it fails, the memory size from the node should be the same as the primary node. If the system can tolerate a temporary outage or performance degradation when the node fails, it is also not possible to Shard.
The following figure depicts 16 shards on 4 nodes, each with one copy in memory and a copy on the hard disk:
The gray arrows highlight the Shard copy on Node 2. Shards on other nodes are also replicated. The Red Arrows depict how the replicas are loaded into memory in the event of Node 2 failure. The uniform distribution of replicas within the cluster makes it possible to store a replica that is activated in the event of a node failure by reserving very little memory. In the above figure, the cluster only reserves 1/3 of the memory to withstand the failure of a single node. It is particularly noted that the activation of the replica (loaded into memory from the hard disk) can take some time, which can result in a short performance degradation or a portion of the data service that is recovering from the outage.
System coordination
In this section we will discuss two techniques related to system coordination. Distributed coordination is a relatively large area, and many people have studied it in depth for decades. This article only covers two types of technologies that have been put into practice. With regard to distributed locks, the consensus protocol and other basic technologies can be found in many books or Web resources, and can be viewed in reference materials [17, 18, 21].
Fault detection
Fault detection is the basic function of any distributed system with fault tolerance. In fact, all the fault detection protocols are based on the heartbeat communication mechanism, the principle is very simple, the monitored components regularly send heartbeat information to the monitoring process (or by the monitoring process polling monitored components), if not received a heartbeat information for a period of time is considered invalid. In addition, there are other functional requirements for a real distributed system:
- Self-adapting. Fault detection should be able to cope with transient network failures and delays, as well as changes in cluster topology, load, and bandwidth. However, this is difficult because there is no way to tell if a process that has not been responding for a long time is really failing, so the fault detection needs to weigh the fault recognition time (how long it takes to identify a real failure, i.e. how long a process loses its response and is considered invalid) and the severity of the false alarm rate. This tradeoff factor should be able to dynamically adjust automatically.
- Flexibility. At first glance, fault detection only needs to output a Boolean value that indicates whether the monitored process is working, but this is not enough in practice. Let's look at a similar example of mapreduce in reference [12]. There is a distributed application consisting of a master node and several working nodes, the master node maintains a job list, and the jobs in the list are assigned to the work node. The master node can distinguish between different degrees of failure. If the primary node suspects that a work node is hanging, he will no longer assign the job to the node. Second, over time, if you do not receive the heartbeat information for that node, the primary node will reassign the jobs running on that node to the other nodes. Finally, the master node confirms that the node is invalidated and frees all related resources.
- Scalability and robustness. Failure detection as a system function should be able to expand as the system expands. He should be strong and consistent, that is, even in the event of a communication failure, all nodes in the system should have a consistent view (that is, all nodes should know which nodes are unavailable, those nodes are available, each node cognition of this can not conflict, not a part of the node know that a node A is not available , while the other part of the node is unaware of the situation)
The so-called cumulative failure detector [12] can solve the first two problems, cassandra[16] It has been modified and applied to the product. The basic workflow is as follows:
- For each monitored resource, the detector records the heartbeat message arrival time Ti.
- Calculates the mean and variance of the arrival time within the statistical forecast range.
- Assuming that the distribution of the arrival time is known (including a normal distribution formula), we can calculate the probability of a heartbeat delay (the difference between the current time T_now and the previous arrival time TC), using this probability to determine if a failure occurred. As suggested in reference [12], you can use a logarithmic function to adjust it to improve usability. In this case, output 1 means that the probability of judging the error (which is considered a node failure) is 10%,2 means 1%, and so on.
According to the importance of different levels of organizational monitoring area, between the regions through the rumor spread protocol or the central fault-tolerant database synchronization, so as to meet the requirements of scalability, but also to prevent the heartbeat information flooding the network [14]. As shown (6 fault detectors make up two zones, communicate with each other through a rumor propagation protocol or a robust library like zookeeper):
Coordinator campaign
The Coordinator campaign is an important technology for strong consistency databases. First, it can organize the fault recovery of master node in the system of master-slave structure. Second, in the case of network isolation, it can be disconnected in a few nodes, to avoid write conflicts.
The Bully algorithm is a relatively simple campaign algorithm for coordinators. MongoDB uses this algorithm to determine the main one in the replica set. The main idea of the Bully algorithm is that each member of the cluster can declare that it is the coordinator and notifies the other nodes. Other nodes can choose to accept this claim or reject it and enter the coordinator competition. Nodes that are accepted by all other nodes can become coordinators. The node follows some attributes to determine who should win. This property can be a static ID, or it can be an updated metric like the last transaction ID (the newest node wins).
Example shows the execution process of the bully algorithm. With a static ID as a measure, a node with a large ID value wins:
- Initially the cluster has 5 nodes, and node 5 is a recognized coordinator.
- Assume that node 5 is dead, and Node 2 and Node 3 also discover this situation. Two nodes start campaigning and send campaign messages to a node with a larger ID.
- Node 4 eliminated nodes 2 and 3, node 3 eliminated Node 2.
- At this point, Node 1 perceives node 5 as defunct and sends campaign messages to all nodes with a larger ID.
- Nodes 2, 3, and 4 all eliminated Node 1.
- Node 4 sends campaign information to node 5.
- Node 5 is not responding, so node 4 declares itself elected and advertises the message to the other nodes.
The Coordinator election process counts the number of nodes involved and ensures that at least half of the nodes in the cluster participate in the campaign. This ensures that in the case of network isolation only a subset of the nodes can select the coordinator (assuming that the network will be divided into a number of areas, the result is not connected to each other, the results of the Coordinator election will inevitably be in the relatively more nodes in the area of the selection of the coordinator, of course, if the region is more than half the number of nodes If the cluster is isolated into chunks, and no chunk has more nodes than half of the total number of nodes in the original node, then the coordinator cannot be elected and, of course, the cluster will not be expected to continue to serve.
Resources
- M. Shapiro et al. A comprehensive Study of convergent and commutative replicated Data Types
- I. Stoica et al chord:a scalable peer-to-peer Lookup Service for Internet applications
- R. J. Honicky, E.l.miller. Replication under Scalable hashing:a Family of algorithms for scalable decentralized Data distribution
- G. Shah. Distributed Data structures for Peer-to-peer Systems
- A. Montresor, GOSSIP protocols for large-scale distributed Systems
- R. Escriva, B. Wong, e.g sirer. HYPERDEX:A distributed, searchable Key-value Store
- A. Demers et al epidemic algorithms for replicated Database maintenance
- G. DeCandia, et al Dynamo:amazon ' s highly Available key-value Store
- R. van Resesse et al efficient reconciliation and Flow Control for anti-entropy protocols
- S. Ranganathan et al gossip-style Failure Detection and distributed Consensus for scalable heterogeneous Clusters
- http://www.slideshare.net/kakugawa/distributed-counters-in-cassandra-cassandra-summit-2010
- N. Hayashibara, X. Defago, R. Yared, T. Katayama. The Phi accrual Failure Detector
- M.J Fischer, N.a Lynch, and M.S. Paterson. Impossibility of distributed Consensus with one faulty Process
- N. Hayashibara, A. Cherif, T. Katayama. Failure detectors for large-scale distributed Systems
- M. Leslie, J. Davies, and T. Huffman. A Comparison of Replication Strategies for Reliable decentralised Storage
- A. Lakshman, P.malik. Cassandra–a Decentralized structured Storage System
- N. A. Lynch. Distributed algorithms
- G. Tel. Introduction to distributed algorithms
- http://basho.com/blog/technical/2010/04/05/why-vector-clocks-are-hard/
- L. Lamport. Paxos Made Simple
- J. Chase. Distributed Systems, failures, and Consensus
- W. Vogels. Eventualy consistent–revisited
- J. C. Corbett et al Spanner:google ' s globally-distributed Database
[Reproduced] A distributed algorithm for NoSQL databases