A distributed algorithm for deeply analyzing NoSQL database

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Distributed applications data structures

Tags .mall analyzing anti- application applications based basic change

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The scalability of the system is the main reason for promoting the development of NoSQL movement, including distributed system coordination, failover, resource management and many other features. That makes NoSQL sound like a big basket that can be stuffed with anything. Although the NoSQL movement does not bring fundamental technological changes to distributed data processing, it still leads to extensive research and practice on protocols and algorithms. It is through these attempts to gradually summarize some effective database construction methods. In this article, I will make a systematic description of the distributed characteristics of the NoSQL database.

Next we'll look at some distributed strategies, such as replication in fault detection, which are marked in bold, and are divided into three segments:

Data consistency. NoSQL needs to be weighed against the consistency, fault tolerance and performance, low latency, and high availability of distributed systems, in general, data consistency is a required option, so this section is primarily about data replication and data recovery.

Data placement. A database product should be able to handle different data distributions, cluster topologies, and hardware configurations. In this section we will discuss how to distribute and adjust the data distribution to be able to solve the problem in a timely manner, provide persistence assurance, efficiently query and ensure that resources in the cluster (such as memory and hard disk space) are used evenly.

Peer-to-peer systems. Technologies such as leader election have been used in multiple database products to achieve fault tolerance and strong data consistency. However, even decentralized databases (without centers) also track their global state, detecting failures and topology changes. This section describes several techniques for keeping the system in a consistent state.

Data consistency

It is well known that distributed systems often experience network isolation or latency, in which case the isolated parts are not available, so it is not possible to maintain high availability without sacrificing consistency. This fact is often referred to as "cap theory". However, consistency is a very expensive thing in a distributed system, so it's often necessary to make concessions on it, not just for usability, but for a variety of trade-offs. To study these trade-offs, we note that the problem of consistency in distributed systems is caused by data isolation and replication, so we'll start with the characteristics of replication:

Availability. In the case of network isolation, the remaining parts can still respond to read and write requests.

Read-Write Latency. Read and write requests can be processed in a short period of time.

Read and write ductility. The pressure of reading and writing can be divided evenly among multiple nodes.

Fault tolerance. Processing of a read-write request does not depend on any particular node.

Data persistence. Node failures under specific conditions do not cause data loss.

Consistency。 Consistency is much more complex than the previous features, and we need to discuss several different points of view in detail. But we don't involve too much consistency theory and concurrency model, because it's beyond the scope of this article, and I'll just use a few simple features that make up a simpler system.

Read-write consistency. From a reading and writing standpoint, the basic goal of a database is to make the replica convergence as short as possible (that is, update the time it passes to all replicas) and ensure final consistency. In addition to this weaker guarantee, there are some stronger consistency features:

Read-write consistency. The effect of writing on a data item x can always be seen on subsequent X read operations.

Read-after read consistency. After a read operation on a data item x, subsequent reads to X should return the same or newer values as the first return value.

Write consistency. The partitioned database often has write conflicts. The database should be able to handle this conflict and ensure that multiple write requests are not processed by different partitions. This database provides several different consistency models:

Atomic writing. If the database provides an API, a single write operation can only be a separate atomic assignment, the way to avoid writing conflicts is to find the "latest version" of each data. This allows all nodes to get the same version at the end of the update, regardless of the order of the updates, and network failures and delays often result in inconsistent update sequences for each node. The data version can be represented by a timestamp or a user-specified value. This is the way Cassandra.

Atomic read-Change-write. Applications sometimes need to read-change-write sequence operations rather than separate atomic writes. If two clients read the same version of the data, modify and write back the modified data, write the model according to the Atom, the time when the later update will overwrite the previous one. This behavior is incorrect in some cases (for example, two clients add new values to the same list value). The database provides at least two solutions:

Conflict prevention. Read-change-write can be considered a special case of transactions, so distributed locks or Paxos such a consistent protocol can solve this problem. This technique supports atomic read rewriting semantics and any isolation-level transactions. Another approach is to avoid distributed concurrent write operations, routing all writes to a particular data item to a single node (either the global master or the partition Master). To avoid conflicts, databases must sacrifice the availability of network isolation. This approach is often used in many systems that provide strong consistency assurance (for example, most relational databases, HBASE,MONGODB).

Conflict detection. The database tracks conflicts for concurrent updates and chooses either to roll back one or to maintain two versions to be resolved by the client. Concurrent updates are usually tracked with a vector clock (which is an optimistic lock) or maintain a complete version history. This method is used for Riak, Voldemort, CouchDB.

Now let's take a closer look at the common replication techniques and divide them into categories according to the characteristics of the description. The first picture depicts the logical relationship between different technologies and the tradeoff between the consistency, extensibility, usability, and latency of different technologies. The second picture depicts each technique in detail.

The copy factor is 4. A read-write coordinator can be an external client or an internal proxy node.

We will pass all the techniques from weak to strong according to consistency:

(A, inverse entropy) is the weakest, based on the following policy. Write the time to choose any node update, read the time if the new data has not been passed through the background of the anti-entropy protocol to the read node, then read the old data is still. (The anti-entropy protocol is described in more detail in the next section.) The main features of this approach are:

High propagation latency makes it less useful in data synchronization, so the more typical usage is to detect and fix unplanned inconsistencies only as an auxiliary function. Cassandra uses the inverse entropy algorithm to pass the database topology and some other metadata information among the nodes.

The consistency guarantee is weak: write conflicts and read and write inconsistencies can occur even in the absence of a failure.

High availability and robustness under network isolation. An asynchronous batch replaces one-by-one update, which makes performance excellent.

Durability protection is weaker because the new data initially has only a single copy.

(B) An improvement to the above pattern is the asynchronous sending of updates to all available nodes at the same time as any node receives an updated data request. This is also considered to be a directed inverse entropy.

Compared with pure inverse entropy, this approach greatly improves consistency with a little bit of performance sacrifice. However, formal consistency and sustainability remain unchanged.

If some nodes are not available at the time because of network failure or node failure, the update will eventually pass through the inverse entropy propagation process to the node.

(C) in the previous pattern, the use of hint transfer techniques can better handle the failure of a node's operation. The expected update of the failed node is recorded on the additional proxy node and indicates that the update is passed to the node once the feature node is available. This improves consistency and reduces replication convergence time.

(D, one-time read-write) because it is also possible that the responsible node that prompts the handover is invalidated before the update is delivered, it is necessary to ensure consistency through so-called read fixes. Each read operation initiates an asynchronous process that requests a summary of the data (such as signatures or hashes) from all nodes that store the data, and unifies the data version on each node if it is found that the summaries returned by each node are inconsistent. We use a one-time read and write to name the technologies that combine a, B, C, and D-they do not provide strict consistency guarantees, but as a self-contained approach they can already be used in practice.

(E, read several) the above strategy is to reduce the replication convergence time heuristic enhancement. In order to ensure stronger consistency, availability must be sacrificed to ensure a certain amount of read and write overlap. The usual practice is to write a w copy at the same time instead of one, and read the R copy when you read it.

First, you can configure the number of write replicas w>1.

Second, because of the r+w>n, there is bound to be overlap between the written node and the read node, so at least one of the multiple copies of the data read is the newer data (w=2, r=3, n=4). This ensures consistency (for a single user's read-write consistency) while reading and writing requests are in sequence (write-completion and reread), but does not guarantee global read consistency. In the example shown in the following illustration, R=2,w=2,n=3, because the write operation for two copies of the update is not a transaction, when the update is not completed when read may read two are old values or a new one old:

For a certain read latency requirement, setting the different values of R and W can adjust write latency and persistence, and vice versa.

If W<=N/2, more than one write will be written to a different number of nodes (for example, write operation a before N/2, B write N/2). Setting the W>N/2 ensures that conflicts are detected in time when the atomic read overwrite of the rollback model is met.

Strictly speaking, this model can tolerate the failure of individual nodes, but the fault tolerance of network isolation is not good. In practice, the use of "approximate quantity through" method is used to improve usability in some scenarios by sacrificing consistency.

(F, read all write a few) read consistency problems can be mitigated by accessing all copies (read data or check summaries) while reading the data. This ensures that as long as the data on at least one node is updated, the new data can be seen by the reader. But this assurance is not going to work in the case of network isolation.

(G, master-slave) This technique is often used to provide read rewriting of the persistent level of atomic write or conflict detection. In order to achieve the level of conflict prevention, you must use a centralized management or a lock. The simplest strategy is to use the master-slave asynchronous replication. Writes for a particular data item are routed to a central node and executed sequentially. In this case, the primary node becomes a bottleneck, so the data must be divided into separate slices (different master for each slice) to provide extensibility.

(H, transactional Read Quorum write Quorum and read one write all) methods to update multiple replicas can avoid write conflicts by using transaction control techniques. A well-known approach is to use the two-phase commit protocol. However, two-phase submissions are not entirely reliable, as the failure of the coordinator may result in resource congestion. The Paxos commit protocol is a more reliable option, but it can lose a bit of performance. One small step forward on this basis is to read a copy of all replicas, which puts the updates of all replicas in a transaction that provides strong fault tolerance consistency but loses some performance and availability.

Some of the trade-offs in the above analysis need to be emphasized

Consistency and availability. The tight trade-off has been given by cap theory. In the case of network isolation, the database either sets the dataset or accepts the risk of data loss.

Consistency and extensibility. See that even if read-write consistency ensures that the extensibility of the replica set is reduced, the write conflict can be handled in a relatively extensible way only in the atomic write model. The atomic read rewrite model avoids conflicts by adding a temporary global lock to the data. This suggests that the dependencies between data or operations, even within a very small or very short time, can damage extensibility. So it is very important to design the data model and store the data fragmentation separately for extensibility.

Consistency and latency. As mentioned above, it should be biased towards reading and writing all replica techniques when the database needs to provide strong consistency or persistence. But obviously the consistency is inversely proportional to the latency of the request, so using several replica techniques would be a relatively acceptable approach.

Failover and consistency/extensibility/latency. Interestingly, the conflict between fault tolerance and consistency, extensibility, and latency is not dramatic. By reasonably discarding some of the performance and consistency, the cluster can tolerate as many as up to nodes fail. This compromise is evident in the difference between the two-stage submission and the Paxos agreement. Another example of this compromise is to add specific consistency guarantees, such as "read-write" with strict session processes, but this adds to the complexity of failover.

Anti-entropy protocol, rumor propagation algorithm

Let's start with the following scenario:

There are a number of nodes where each piece of data holds a copy on several nodes. Each node can handle the update requests individually, and each node synchronizes with the other nodes on a regular basis, so that after a while all the replicas will converge. How does synchronization work? When does synchronization start? How do I choose a synchronized object? We assume that two nodes always overwrite older data with newer versions of data or that two versions are reserved for application tier processing.

This problem is common in scenarios such as data consistency maintenance and cluster State synchronization, such as cluster member information propagation. While it is possible to introduce a coordinator who monitors the database and develop synchronization plans, the centralized database provides better fault tolerance. The main approach to the process is to use well-designed infection protocols, which are relatively simple, but provide a good convergence time, and can tolerate any node failure and network isolation. Despite the many types of infection algorithms, we only focus on the anti-entropy protocol, because the NoSQL database uses it.

The inverse entropy protocol assumes that synchronization will be performed according to a fixed schedule, each node randomly or according to some rule to select another node to Exchange data and eliminate differences. There are three reverse-style anti-entropy protocols: Push, pull and mix. The principle of the push protocol is to simply select a random node and then send the data state to the past. It is obviously silly to push all the data out in a real-world application, so the nodes generally work the way shown in the following figure.

Node A as the synchronization initiator prepares a data digest containing the fingerprint of the data on a. Node B compares the data in the summary with the local data after receiving the summary and returns a summary of the data variance to a. Finally, a sends an update to B,B to update the data. The pull and blend protocols are similar to this, as shown in the previous illustration.

The inverse entropy protocol provides a good enough convergence time and extensibility. The following figure shows a simulation result that propagates an update in a 100-node cluster. In each iteration, each node is contacted only by a randomly selected peer node.

We can see that the convergence of the pull method is better than the push method, which can be proved theoretically. And there is a "convergent tail" problem with the push method. After many iterations, although almost all nodes are traversed, there are still a small number of unaffected parts. The blending method is more efficient than the simple push and pull methods, so it is usually used in practical applications. The inverse entropy is extensible because the average conversion time grows in the form of logarithmic functions in cluster size.

Although these techniques look simple, there are still many studies focusing on the performance of the inverse entropy protocol under different constraints. One of them replaces random selection with a more efficient structure using a network topology. Adjust the transmission rate or use advanced rules to select the data to synchronize under the condition of limited network bandwidth. Summary computing also faces challenges, and the database maintains a recently updated log to help with summary calculations.

Eventually consistent data Types

In the previous section we assumed that two nodes always merged their data versions. However, it is not easy to resolve update conflicts, so it is surprisingly difficult for all replicas to eventually reach a semantically correct value. A well-known example is that an item that has been deleted in the Amazon Dynamo database can be reproduced.

Let's assume an example to illustrate this problem: the database maintains a logical global counter, and each node can increase or decrease the count. Although each node can maintain one of its own values locally, these local counts cannot be merged by simple addition and subtraction. Suppose such an example: There are three nodes A, B, and C, and each node performs an additional operation. If a gets a value from B and adds it to the local copy, then C gets the value from B, and then C gets the value from a, then C's last value is 4, which is wrong. The solution to this problem is to maintain a pair of counters for each node with a data structure similar to the vector clock:

[JS] View plaincopy

1 class Counter {

2 int] Plus

3 int] Minus

4 int node_id

6 increment () {

7 plus[node_id]++

Decrement () {

One minus[node_id]++

12}

Get () {

return sum (plus) –sum (minus)

16}

Merge (Counter other) {

In 1..max_id {

Plus[i] = max (Plus[i], other.plus[i])

Minus[i] = max (Minus[i], other.minus[i])

22}

23}

24}

Cassandra is counted in a similar way. A more complex and ultimately consistent data structure can also be designed using a state based or operational replication theory. For example, a list of such data structures is mentioned, including:

Counters (plus and minus operations)

Collections (Adding and removing operations)

Figure (adding edges or vertices, removing edges or vertices)

List (insert somewhere or remove a location)

The functionality of the final consistent data type is usually limited and can incur additional performance overhead.

Data placement

This section focuses on algorithms that control the placement of data in distributed databases. These algorithms are responsible for mapping data items to appropriate physical nodes, migrating data between nodes, and global provisioning of resources such as memory.

Balanced data

We'll start with a simple protocol that provides seamless data migration between cluster nodes. This often occurs in scenarios such as cluster expansion (joining a new node), failover (some node downtime), or balanced data (where the data is unevenly distributed between nodes). The scenario depicted in figure a below-there are three nodes, and the data is randomly distributed across three nodes (assuming that the data are Key-value type).

If the database does not support internal data equalization, it is necessary to publish the database instance on each node as shown in Figure B above. This requires a manual cluster extension, stopping the instance of the database being migrated, transferring it to the new node, and then starting on the new node, as shown in Figure C. Although the database is able to monitor every record, including MongoDB, Oracle coherence, and many systems that are still in development Redis Cluster still use automatic equalization techniques. That is, the data is fragmented and each data fragment as the smallest unit of migration, which is based on efficiency considerations. It is obvious that the fragment number will be more than the number of nodes, and the data fragmentation can be distributed evenly among the nodes. Seamless data migration can be achieved by a simple protocol that redirects the client's data to and from the node when migrating data fragments. The following figure depicts a state machine for get (key) logic implemented in a Redis cluster.

Assuming that each node knows the cluster topology, it can map any key to the corresponding data fragment and map the data fragment to the node. If the node determines that the requested key belongs to a local fragment, it is found locally (the box above in the above image). If the node determines that the requested key belongs to another node x, he sends a permanent REDIRECT command to the client (the box below in the image above). Permanent redirection means that the client can cache the mapping relationship between the fragment and the node. If the fragment migration is in progress, the relocation node and the migration node mark the corresponding fragment and lock the fragmented data into a lock and then start moving. Moving out the node will first find the key locally, if not found, redirect the client to the migration node, if key has been migrated. This redirection is one-time and cannot be cached. The migration node handles the redirection locally, but the periodic query is permanently redirected before it is completed.

Data fragmentation and replication in dynamic environments

Another issue we focus on is how to map records to physical nodes. The direct method is to use a table to record each range of key and node mapping relationship, a range of key to a node, or the key of the hash value and the number of nodes to take the modulus of the value as the node ID. However, the method of hashing is not very useful when the cluster changes, because the increase or decrease of the node will cause the data in the cluster to be completely rearrangement. Makes it difficult to replicate and fail back.

There are many ways to enhance the perspective of replication and failure recovery. The most famous is the consistency hash. There is already a lot of information about consistency hash on the web, so here I only provide a basic introduction, just for the integrity of the article content. The following figure depicts the rationale for a consistent hash:

A consistent hash is fundamentally a key-value mapping structure-it maps a key (usually a hash) to a physical node. The value space after the key has been hashed is an ordered fixed-length binary string, and it is clear that each key in this range will be mapped to one of the three nodes in a, B, and C in figure A. For replica replication, the value space is closed to a loop, moving clockwise along the loop until all replicas are mapped to the appropriate node, as shown in Figure B. In other words, Y will be positioned on Node B because it is in the range of B, the first copy should be placed in C, the second copy should be placed in a, and so on.

The benefits of this structure are reflected in the increase or decrease of a node, because it only causes the data in the pro region to be balanced. As shown in Figure C, the addition of node D will only affect the data item x and have no effect on Y. Also, removing node B (or B failure) affects only the copies of Y and X, without affecting the x itself. But there is also a weakness in this approach, which is that the burden of rebalancing is borne by neighboring nodes, which move large amounts of data. By mapping each node to multiple scopes rather than a range, you can mitigate the adverse effects of this problem to some extent, as shown in Figure D. This is a compromise that avoids overly concentrated load when rebalancing data, but maintains a proper reduction in the total amount of balance compared to a module based mapping.

It is not easy to maintain a complete and coherent hash ring for large-scale clusters. There is no problem with a relatively small database cluster, and it is interesting to study how to combine data placement with network routing in Peer-to-peer networks. A better example is the chord algorithm, which gives the integrity of the loop a concession to the lookup efficiency of a single node. The chord algorithm also uses the idea of a ring-mapped key to a node, which is similar to a consistent hash. The difference is that a particular node maintains a short list, and the logical position of the nodes in the list on the loop is exponentially increasing (as shown below). This makes it possible to use a binary search that requires only a few network jumps to locate a key.

This picture is a 16-node cluster that depicts how node a finds key on node D. (A) The route is depicted, and (B) The local images of the rings for nodes A, B, and C are depicted. More information on data replication in decentralized systems is available in resources.

Data fragmentation by multiple attributes

The consistent hash data placement strategy works well when data is accessed only through primary keys, but things are much more complicated when you need to query with multiple attributes. A simple approach (used by MongoDB) is to use primary keys to distribute data without regard to other attributes. The result is that queries based on the primary key can be routed to the appropriate node, but processing of the other queries traverses all nodes of the cluster. The imbalance in query efficiency creates the following problems:

There is a dataset in which each data has several attributes and corresponding values. Is there a data distribution strategy that allows queries that qualify any number of attributes to be delivered to as few nodes as possible?

Hyperdex database provides a solution. The basic idea is to treat each attribute as an axis in a multidimensional space and map the region of space to a physical node. A query is mapped to a hyperplane that consists of multiple contiguous areas of space, so only those areas are related to the query. Let's look at an example in resources:

Each piece of data is a user information that has three properties first name, last name, and phone number. These attributes are considered as a three-dimensional space, and a viable data distribution strategy is to map each quadrant to a physical node. Queries such as "First Name = John" correspond to a plane that runs through 4 quadrants, and that is, only 4 nodes participate in this query. A query with two attribute limits corresponds to a line that runs through two quadrants, as shown in the previous illustration, so that only 2 nodes are involved.

The problem with this method is that the space quadrant will grow exponentially as an attribute number. As a result, queries that have only a few attribute restrictions are projected into many space areas, or many servers. This problem can be mitigated to some extent by splitting a data item with more attributes into fewer subkeys and mapping each subkey to a separate subspace instead of mapping the entire data to a multidimensional space:

This provides better mapping of queries to nodes, but increases the complexity of cluster coordination because in this case a single piece of data is scattered across several separate subspaces, each of which corresponds to several physical nodes, and the transaction must be considered when the data is updated.

passivation Copy

Some applications have strong random read requirements, which requires that all data be placed in memory. In this case, it usually takes twice times more memory to fragment the data and copy each fragment, because each data has a copy on the master node and from the node. The memory size from the node should be the same as the primary node in order to act as a substitute when the main node fails. If the system can tolerate the failure of the node when the temporary interruption or performance degradation, you can not slice.

The following figure depicts 16 slices on 4 nodes, each of which is in memory and the replica exists on the hard disk:

The gray arrows highlight the fragment replication on Node 2. The fragments on the other nodes are replicated as well. The red arrow depicts how the copy is loaded into memory when node 2 fails. The uniform distribution of replicas within a cluster makes it possible to store replicas that are activated when the node is invalidated, with only a small amount of memory to be reserved. In the above diagram, the cluster reserves only 1/3 of the memory to withstand the failure of a single node. It is particularly noted that the activation of the replica (loaded from the hard disk into memory) takes some time, which can result in a short period of performance degradation or in the part of the data service that is being recovered.

System coordination

In this section we will discuss two techniques related to system coordination. Distributed coordination is a relatively large area, and many people have studied it in depth for decades. This article covers only two techniques that have been put into practice. With regard to distributed locks, consensus protocols and other basic technologies can be found in many books or network resources.

Fault detection

Fault detection is the basic function of any distributed system with fault tolerance. In fact, all the fault detection protocols are based on the heartbeat communication mechanism, the principle is very simple, the monitored component sends the heartbeat information to the monitoring process regularly (or by the monitoring process polling monitored components), if there is a period of time did not receive heartbeat information is considered invalid. In addition, there are other functional requirements for a truly distributed system:

Adaptive. Fault detection should be able to cope with temporary network failures and delays, as well as changes in cluster topology, load, and bandwidth. But that's a lot of difficulty because there's no way to tell if a long, unresponsive process really fails. Therefore, fault detection requires a trade-off between the fault identification time (how long it takes to identify a real fault, how long a process loses its response after it is considered a failure), and the severity of the false alarm rate. This tradeoff factor should be dynamically adjusted automatically.

Flexibility。 At first glance, fault detection only needs to output a Boolean value indicating whether the monitored process is working, but in practice this is not enough. Let's look at a similar mapreduce example in resources. There is a distributed application consisting of a master node and several work nodes, the master maintains a list of jobs, and assigns jobs in the list to work nodes. The master node can distinguish between varying degrees of failure. If the primary node suspects that a work node is dead, he will not assign jobs to the node. Second, over time, if the heartbeat information for the node is not received, the master node will reassign the job running on the node to another node. Finally, the master node confirms that the node is invalidated and frees all related resources.

Scalability and robustness. Failure detection as a system function should be able to expand as the system expands. He should be robust and consistent, that is, even in the event of a communication failure, all nodes in the system should have a consistent view (that is, all nodes should know which nodes are unavailable, those nodes are available, and each node will not be aware of this conflict, and some nodes can not see a node A is not available , while the other part of the node does not know the situation)

The so-called cumulative failure detector can solve the first two problems, Cassandra it has been modified and applied in the product. The basic workflow is as follows:

For each monitored resource, the detector records the heartbeat message arrival time Ti.

Calculates the mean and variance of the arrival time in the range of statistical predictions.

Assuming that the distribution of arrival time is known (the following figure includes a normal distribution formula), we can calculate the probability of a heartbeat delay (the difference between the current time T_now and the last arrival time TC), and use this probability to determine whether a failure occurred. You can use the logarithmic function to adjust it to increase availability. In this case, output 1 means that the probability of judging the error (the node fails) is 10%,2 means 1%, and so on.

According to the importance of different levels to organize the monitoring area, each region through rumors spread protocol or central fault-tolerant library synchronization, so that can meet the requirements of scalability, but also to prevent heartbeat information in the network overflow. As shown in the following illustration (6 fault detectors comprise two zones, which are associated with each other through a rumor-propagation protocol or a robust library such as zookeeper):

Coordinator campaign

The Coordinator campaign is an important technique for strong consistency databases. First, it can organize the fault recovery of the master node in the system of master-slave structure. Second, in the case of network isolation, it can disconnect the few nodes in order to avoid writing conflicts.

Bully algorithm is a relatively simple coordinator campaign algorithm. MongoDB used this algorithm to determine the main one in the replica set. The main idea of the bully algorithm is that each member of the cluster can declare it to be the coordinator and notify the other nodes. Other nodes may choose to accept this claim or reject it and enter the coordinator competition. Nodes accepted by all other nodes can become coordinators. The node follows some attributes to determine who should win. This property can be either a static ID or an updated metric like the last transaction ID (the newest node wins).

The example below shows the execution of the bully algorithm. With a static ID as a metric, a node with a larger ID value wins:

Initially the cluster has 5 nodes, and node 5 is a recognized coordinator.

Let's say node 5 is dead, and Node 2 and Node 3 are discovering this at the same time. Two nodes start the campaign and send campaign messages to the nodes with larger IDs.

Node 4 eliminated nodes 2 and 3, node 3 eliminated Node 2.

Node 1 is aware of node 5 failure and sends campaign information to all nodes with larger IDs.

Nodes 2, 3, and 4 eliminated node 1.

Node 4 sends campaign information to node 5.

Node 5 was not responding, so node 4 declared itself elected and notified the other nodes of the message.

The coordinator campaign will count the number of participating nodes and ensure that at least half of the nodes in the cluster are involved in the campaign. This ensures that only a subset of the nodes can elect the coordinator in the case of network isolation (assuming that the network will be split into multiple regions, each other, the results of the Coordinator election are bound to select the coordinators in the area with relatively many node points, assuming that the available nodes in that area are more than half the number of nodes in the cluster. If the cluster is isolated into several blocks, without a single chunk more than half the total number of nodes in the original node, then the coordinator cannot be elected and, of course, the cluster is not expected to continue to provide services.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More