For distributed systems, we must deeply understand and remember one thing: the unreliability of distributed systems.
"Reliability" means that the system can run continuously without fault, and if a system is accidentally down or is not working properly, then he is an unreliable system, even if the downtime and unusable time is very short. We know that the distributed system is usually composed of independent servers through the network loosely coupled, and the network is essentially a complex I/O system, and generally, I/O failure probability and reliability is much higher than the CPU and memory of the host, coupled with the introduction of network equipment, It also increases the likelihood of extensive paralysis in the system. In short, the important theory and design of distributed system is based on the premise that "distributed system is unreliable", because the system is unreliable, so we need to add some additional complex design and function to ensure that the probability of the system being unavailable is minimized because of the unreliable of the distributed system. "Availability" is a calculated metric that if the system crashes at 1ms per hour, his availability is more than 5 9, and if a system is never crashed, but has to stop for two weeks each year, then he is highly reliable, but only 96% available;
After understanding the principle of reliability of distributed systems, the next step is to explain one thing – the principle of consistency. The consistency of distributed cluster is "a huge stone that cannot be bypassed" in distributed system, and many important distributed systems involve consistency problem, and several consistency algorithms are very complex to solve this problem at present.
The consistency scenario described in distributed systems is as follows:
n nodes form a distributed cluster, ensuring that all nodes can execute the same sequence of commands and achieve a consistent state. That is, after each node executes the same sequence of commands, the results of each node are the same. In fact, because of the unreliability of distributed systems, it is usually only possible to ensure that more than half of the nodes in the distributed cluster (N/2 + 1) are normal and consistent to meet the requirements.
As shown in the following illustration, a typical case of a distributed cluster "consistency" algorithm comes from Kafka.
When the client initiates a write message request to the Kafka cluster, the leader of the cluster first writes a copy of the data locally and initiates a remote write request to multiple follower, a process that may cause some follower nodes to fail to answer (ACK). At this point, according to the "consistency" algorithm, if more than half of the nodes in the cluster respond correctly, this operation is performed successfully. The image above includes leader successfully for two nodes, so leader will submit (commit) message data and successfully return it to the client, otherwise the data will not be submitted and this write request fails.
Because the scenario described by the "consistency" algorithm is representative, and almost every system involved in data persistence in a distributed system will face this complex problem, and the complexity and challenge of the algorithm itself, it has been one of the hot research topics in the Distributed field.
In each node of a distributed cluster, the consistency we are discussing generally refers to "final consistency". In fact, the final consistency is to reduce the standard of consistency, that is, "data consistency exists latency" in exchange for data literacy performance. At present, final consistency is basically a design goal followed by more and more distributed systems, and the complete description of its scenario is as follows:
In a distributed database cluster (read-write separated database), assuming that data B is updated, subsequent reads to data b are not necessarily the updated value, and the current value from data B to subsequent reads to B is delayed, and this delay is also called the "inconsistent window" (inconsistency window), the maximum of inconsistent windows can be determined according to the following factors: communication latency, system load, number of copies involved in the replication scheme, and so on. The final consistency guarantees that the time of the inconsistent window is limited, and that all of the read operations will eventually return the latest value of B. DNS is a successful example of using final consistency.
Reference: Disclosure of architecture from distributed to micro services