The characteristic of distributed system is that it can carry on state exchange under high delay or unreliable transmission condition. If we want to ensure the reliability of the operation of the system, we must ensure the robustness of both node and network failures, but not all systems can meet the security capabilities we require. In this article, we'll explore some of the design considerations for distributed databases and how they respond to network partitioning.
When sending messages between two nodes, the IP network may arbitrarily delete, delay, reorder, or replicate messages, so many distributed systems use TCP to prevent the reordering and replication of messages. But TCP/IP is still inherently asynchronous: the network delays messages arbitrarily, and connections can be cut off at any time. In addition, fault diagnosis is not reliable: it may be impossible to determine whether a node has failed, if the network connection is disconnected, or if the operation is slower than expected.
The failure of a message to be arbitrarily delayed or severed is called a network partition. Partitions can occur on a network of production environments for a variety of reasons: garbage collection (GC) pressure, network card (NIC) failure, switch failure, configuration error, network congestion, and so on. Because of partitioning, the cap theorem limits the maximum guaranteed capacity that a distributed system can achieve. When a message is cut off, the "consistent" (CP) system rejects requests from certain nodes to maintain linear consistency. The available (AP) system, while capable of processing requests on all nodes, must sacrifice linear consistency because different nodes may have different opinions about the order of operations. When the network is in good condition, the system can ensure consistency and availability, but because the real network will always produce partitions, so there is no complete "consistent and available" (CA) system.
It is also worth mentioning that the cap theorem is not only valid for the overall database, but also for various subsystems such as tables, keys, columns, and even various operations. For example, a database can provide consistency independently for each key, but it does not guarantee consistency between keys. This compromise strategy allows the system to handle more requests when partitioning occurs. Many databases can adjust the level of consistency of a single read-write operation at the cost of performance and correctness.
Test partition
Theorists have set a design space, but the real software may not be able to reach this range. We need to test the performance of a system to truly understand its performance.
First, we need to prepare a set of nodes for testing, I've built 5 LXC nodes on a Linux computer, but you can also choose to use Solaris Zone, VMS, EC2 nodes, physical hardware, and so on. You need these nodes to share some kind of network, in my case, using a separate virtual bridge interface, I named these nodes N1, N2 、... N5, and established DNS and host operating systems between them.
In order to create a partition, you need some way to cut or delay messages, such as firewall rules. On Linux, you can use the Iptables-a input-s some-peer-j drop command to create a one-way partition in which messages from certain nodes to the current node are cut off. When you apply these rules to multiple hosts, you create an artificial pattern of network data loss.
Running these commands repeatedly on multiple hosts requires a little effort, I used a tool I wrote myself, called Salticid. But you can also use CSSH or any other cluster automation system. The key factor is latency-you need to be able to quickly start and terminate a distribution, so like chef or other slow convergence systems may not be much use.
Next you need to build a distributed system on these nodes and design an application for testing. I wrote a simple test: This is a clojure program that runs outside of the cluster and simulates five isolated clients with multithreading. The client adds n integers to a collection in the distributed system concurrently, one node writes 0, 5, 10 ..., and the other node writes 1, 6, 11 ... Wait a minute. Regardless of success or failure, each client records its write operations in the log. When all writes are completed, the program waits for the cluster to converge, then checks to see if the client's logs match the actual state in the database. This is a simple consistency check, but can be used to test a variety of different data models.
The client and configuration automation in this example, including mock partitions and scripts to create a database, is free to download. Please click here for code and instructions.
PostgreSQL
The PostgreSQL instance of a single node is a CP system that provides serializable consistency to the transaction at the cost of being unavailable when a node fails. However, this distributed system chooses to compromise the server, and a client does not guarantee consistency.
The Postgres commit protocol is a special case of two-phase commit (2PC), in the first phase, the client submits (or undoes) the current transaction and sends the message to the server. The server checks whether its consistency constraint allows the transaction to be processed, and processes the commit if it is allowed. After it writes the transaction to the storage system, it notifies the client that the commit has been processed (or may fail in this case). The client now agrees with the server on the outcome of the transaction.
And what happens if you confirm that the submitted message was cut off before it reached the client? At this point the client does not know whether the submission is successful or failed! The 2PC protocol stipulates that the node must wait for the acknowledgment message to arrive to determine the outcome of the transaction. If the message does not arrive, 2PC will not complete correctly. So this is not a zoning tolerance agreement. The real system cannot wait indefinitely, at some point the client will generate a timeout, which makes the commit protocol stay in an intermediate state.
If I artificially cause this partition to occur, the JDBC Postgres client throws an exception that resembles the following type:
217 an I/O error occurred while sending to the backend.
Failure to execute query with SQL:
INSERT into "Set_app" ("element") VALUES (?) ::
[219]
psqlexception:< C6/>message:an I/O error occured while sending to the backend.
sqlstate:08006
Error code:0
218 an I/O error occured while sending to the backend.
We might interpret it as "write operations 217 and 218 fail." But when the test application queries the database to find out what successful write transactions are, it finds that both of these "failed" writes also appear in the results:
1000 Total
950 acknowledged
952 survivors
2 unacknowledged writes Found!ヽ (ー ') ノ
(215 218)
0.95 ack Rate
0.0 loss rate 0.002105263 unacknowledged but successful rate
of the 1000 write operations, 950 were successfully confirmed, and the 950 write records appear in the result set. However, both the 215 and the 218 write operations succeeded, even though they threw an exception! Note that the exception here does not ensure that the write operation succeeds or fails: The NO. 217 operation also throws an I/O error when it is sent, but because the client's commit message was disconnected before it reached the server, the transaction failed to execute successfully.
There is no way to reliably differentiate between these situations from the client. Network partitions, or in fact, most network errors do not represent failure, it simply means the lack of information. There is no way to determine the state of the write operations without a partitioned tolerant submission protocol, such as an extended 3-phase commit protocol.
If you modify your operation to Idempotent (idempotent) and try again and again, you can handle this uncertainty. Alternatively, you can use the ID of the transaction as part of the transaction and query the ID after the partition is restored.
Redis
Redis is a data structure server that is typically deployed within a shared heap. Because it runs on a single-threaded server, it provides linear consistency by default, and all operations are performed in a single, well-defined order.
Redis also provides an asynchronous master-from distribution capability. A server is selected as the primary node, it can accept writes, and then distributes state changes to individual nodes. In this context, the asynchronous approach means that when the primary node distributes a write, the client request is not blocked because the write "final" will reach all the nodes.
In order to deal with node Discovery (Discovery), select Main node and failover operation, Redis contains an additional system: Redis Sentinel (Sentinel). The Sentinel nodes constantly communicate the status of each Redis server they access, and attempt to upgrade and demote the nodes to maintain a single authoritative master node. In this test, I installed the Redis and Redis Sentinels on all 5 nodes. At first all 5 clients read data from the primary node N1 and from the node N2 to the N5, and then I partitioned N1 with N2 and other nodes.
If Redis is a CP system, then N1 and N2 become unavailable when partitioning occurs, and one of the other important components (N3, N4, N5) is selected as the primary node. But this is not the case, in fact the write operation will still be executed successfully on the N1, after a few seconds, the Sentinel node will detect the occurrence of the partition, and the other node (assuming N5) is selected as the new master node.
At this time of partitioning, there are two primary nodes in the system, one for each component of the system, and two for write operations independently. This is a typical brain-splitting scenario and violates C (consistency) in CP. Write (and read) operations in this state are not serializable because they observe different database states depending on which nodes the client is connected to.