I. Overview of CAP theory
A distributed system can meet only two of the three items of consistency (consistency), availability (availability), and partition fault tolerance (Partition tolerance) at the same time.
ii. definition of Cap
1, consistency consistency
Consistency means "all nodes see the same data at the same time", that is, when the update operation succeeds and returns to the client, all nodes are fully consistent at the same date.
For consistency, it can be divided into two different perspectives from the client and server side.
From the client side, consistency mainly refers to the problem of how the updated data gets when multiple concurrent accesses are being accessed.
From the server side, it is how updates replicate across the system to ensure that the data is ultimately consistent.
Consistency is due to the problem of concurrent read and write, so in understanding the consistency of the problem, it is important to consider the combination of concurrent read and write scenarios.
From the client's perspective, when multi-process concurrent access, the updated data in different processes how to obtain different policies, determine the different consistency. For relational databases, it is strong consistency to require that the updated data be visible to subsequent accesses. If you can tolerate any subsequent partial or full access, it is weak consistency. If the updated data is required after a period of time, it is final consistency.
2. Availability Availability
Availability refers to "Reads and writes always Succeed", which is the service is available and is the normal response time.
For an availability distributed system, each non-faulted node must respond to each request. That is, any algorithm used by the system must eventually terminate. When partitioning tolerance is required, this is a strong definition: even for serious network errors, each request must be terminated.
Good usability mainly refers to the system can be very good for the user Service, there is no user operation failure or access timeout, such as bad user experience. Availability and distributed data redundancy, load balancing, etc. are often associated with usability.
3, Partition tolerance partition fault tolerance
Partition fault tolerance means "the system continues to operate despite arbitrary message loss or failure of part of the system", that is, distributed systems encountering a node or network partition so Can still provide services that meet consistency and availability. partitioning is closely related to fault tolerance and extensibility .
In distributed applications, the system may not function properly due to some distributed causes. Good partitioning of fault tolerance requires that the application be a distributed system, but it seems to be in a functioning whole. For example, the current distributed system has one or several machines have been down, the rest of the machine can be run to meet the requirements of the system, or the machine has network anomalies, the distributed system separated into separate parts, the parts can also maintain the operation of the distributed system, so that has good partition fault tolerance.
iii. Proof of Cap
For example, we prove the basic scenario of CAP, the network has two nodes N1 and N2, can simply understand N1 and N2 are two computers, their network can be connected, N1 has an application A, and a database v,n2 also has an application B2 and a database v. Now, A and B are two parts of the distributed system, and V is the two sub-database of the distributed system's data storage.
The data in N1 and N2 is the same when it meets the consistency, v0=v0.
When the availability is met, the user will be immediately responsive, whether it is requesting N1 or N2.
In the case of partitioning fault tolerance, either the N1 and N2 are down, or if the network is not working, it will not affect the normal operation between N1 and N2.
For example, the normal operation of the distributed system process, the user to the N1 machine request data updates, program a update the database VO for V1, the distributed system will synchronize the data of M, will V1 synchronous N2 V0, so that N2 data V0 also updated to v1,n2 the data to respond N2 request.
Here, you can define whether the data between N1 and N2 's database V is the same as consistency, the external request response to N1 and N2 is an available row, and the network environment between N1 and N2 is partition-tolerant. This is the normal operation of the scene, but also the ideal scene, but the reality is cruel, when the error occurs, consistency and availability of partition fault tolerance, whether it can be met, or to make a choice?
As a distributed system, the biggest difference between it and a stand-alone system is the network, now assume an extreme situation, the network between N1 and N2 disconnected, we want to support this network anomaly, equivalent to meet the partition fault tolerance, can satisfy both consistency and responsiveness? Or do you want to make a choice for them?
Assuming the network disconnects between N1 and N2, a user sends a data update request to N1, and the data V0 in N1 is updated to V1, because the network is disconnected, so the distributed system synchronously operates m, so the data in N2 is still V0; At this time, a user sends a data read request to N2, Since the data has not yet been synchronized, the application is not able to immediately return the latest data V1 to the user,
What do we do? There are two options,
First, the sacrifice of data consistency, in response to the old data V0 to the user;
Second, sacrificing availability, blocking waits until the network connection is restored, data update operation M is complete, and then the user responds to the latest data V1.
This process proves that a distributed system that satisfies partition fault tolerance can only choose one of both consistency and availability.
iv. Cap Trade-offs
With the CAP theory, we know that we can't meet the three features of consistency, availability, and partition fault tolerance at the same time, which one to discard?
- CA without P: if p is not required (partitioning is not allowed), then C (strong consistency) and a (availability) are guaranteed. But the partition is not the problem you want to do, but it will always exist, so the CA's system is more to allow the partition after the subsystems remain CA.
- CP without A: if A (available) is not required, the equivalent of each request needs to be strongly consistent between servers, and P (partition) can cause unlimited synchronization time, so the CP is also guaranteed. Many traditional database distributed transactions belong to this model.
- AP wihtout C: To be highly available and allow partitioning, you need to discard the consistency. Once a partition occurs, the nodes may lose contact, and in order to be highly available, each node can only serve with local data, which can result in inconsistencies in global data. Many of the NoSQL classes now fall into this category.
For most large-scale Internet applications, the host is numerous, deployment is scattered, and now the cluster size is growing, so node failure, network failure is the norm, and to ensure that the service availability of N 9, that is, to ensure that P and a, discard C (back to the second to ensure final consistency) . While some places affect the customer experience, it does not reach the severity of the user process.
For scenarios involving money so that there is no compromise, C must be assured. the network fails rather to stop the service, which is to guarantee the CA, which discards p. It seems that the domestic banking industry in recent years, there are not more than 10 accidents, but the impact of small, reporting is not much, the broad masses know less. There is also a guarantee CP, discard a. For example the network fault thing is read-only not written.
Whichever is better, there is no conclusion, only according to the scene to decide, suitable is the best.
Reproduced
Cap Theory for Distributed systems
Cap Theory for Distributed systems