The common basic means to ensure the reliability and availability of an online system is multi-copy and master-slave. Both solutions aim to eliminate spof. A single point is that in a system, only one instance is running for a certain service or function module. The problem is that once the instance is offline, the entire system will go down. Once the instance loses data, the entire system will lose data.
The elimination of single points is nothing more than increasing the number of instances, which we often call "redundancy ". But redundancy is not that simple. Some services or modules do not have a persistent state (generally speaking, they do not store data). It is easy to add redundancy and deploy multiple services as required. However, if the service or module needs to save data, the problem becomes complicated. Two commonly used redundancy solutions are presented in practice: master-slave and multi-copy.
The multi-copy solution is intuitive: If you don't want to use a single point, you can increase the number of data copies, break down one, and there are other. However, considering that the service is offline or damaged, the data between multiple replicas may be different. We cannot determine that the data read from a copy is up-to-date. Any data written to multiple copies at a time cannot be guaranteed to succeed. This is what we often say about "consistency. To obtain the correct data, all copies must be integrated. Generally, the number of successfully written copies must exceed a certain number, and the number of successfully read copies must also exceed a certain number. If the two values are set properly, the latest data can be read.
The master-slave solution is easier to understand: the slave server can replace the master server and continue to work if the master server is down or data is lost without stopping the backup. There are two ways to back up data between the master and slave nodes: asynchronous backup and synchronous backup. In asynchronous mode, after the data is written to the master server, it is reported to the client. Then, when appropriate, the data is synchronized to the slave server. Synchronous backup means that after the data is written to the master server, the master server forwards the data to the slave server. When the slave server also writes the data successfully, the master server returns the result to the client.
Obviously, asynchronous master-slave backup cannot guarantee reliability. Because when the master server loses data, there will always be some data that cannot be backed up to the slave server, and the data will be lost forever.
Therefore, synchronous data backup is more reliable than a data backup. Two copies of data exist for each correct write. The master server goes down and there is another copy from the slave server. (Of course, the actual situation is far more complex than this, so we will not start it here ). A major problem with data backup synchronization is performance. Data Writing requests must be forwarded to the slave server through the master server and received feedback before they can respond to users. The request latency is greatly increased.
Now, let's take a look at other aspects. After reliability, we have to consider availability. The synchronous write mode requires that the master and slave servers write data successfully to notify the customer of the success. Therefore, when either of the master and slave servers is deprecated, the system is unavailable. If the requirement is relaxed and the slave server does not have to write successfully, some data will still be lost if only one copy of the data on the master server is used.
To ensure reliability while improving availability, a reasonable choice is to increase the number of slave servers, such as 2, 3, or even 5. It is not necessary that each slave server is successfully written. If one slave server is successfully written to the two slave servers, customer data is fed back for proper storage. So. It can ensure sufficient reliability, and the availability will not be affected by the removal of a single server.
(Bad feelings I have about this, huh)
The following problem is consistency. The biggest advantage of the master-slave mode is that you don't have to worry about consistency, because the data on the master server is successfully written. It is consistent with the user's perspective.
Is that true? This is generally the case when there is no exception. When the primary server encounters problems, such as hardware failure or system crash, you need to upgrade a slave server to the primary server to ensure system availability.
However, as mentioned earlier, the slave server does not have to write successfully every time. Therefore, data on the master server may be incomplete. Data is less than the original master server. In the eyes of users, data is lost in the system at this time. Although the data is still there, it is not on the current master server.
Now let's solve this problem. The root cause of the problem is that the data on the slave server is inconsistent with that on the master server. The most direct solution is to try to make them consistent. When the master server performs synchronous write to the slave server, it can try multiple times to improve the success rate. However, after all, there is still a retry that cannot be successful. If this is the case, data inconsistency still exists. If you keep trying this way, you will not be able to respond to the user. The only practice is to feedback the user and try again. To reduce system complexity, subsequent retry operations are usually executed asynchronously in a separate service module. However, no matter whether the Retry is performed in the Master Logic or in an asynchronous independent module, it cannot be completely successful. System exceptions interrupt retry efforts. The retry task may even be lost. Retry the Failover process will fail itself. If you want to ensure that the Failover can be correctly and timely executed, you must handle the Failover. According to this logic, there are also failover's failover .... No. Therefore, data on the master and slave servers cannot be completely consistent without affecting availability. This road cannot be achieved.
Then, let's take the second step, tolerate data inconsistency between the master and slave servers, and look for ways to eliminate data inconsistency. As mentioned above, we do not have to make sure that each write operation on each slave server is successful. We only need to ensure that a certain number of slave servers are successful. Therefore, data integration between multiple slave servers ensures consistency with the master server. In this way, once the master server goes down, the master server is upgraded from the server to the master server. Each time the new master server receives a data read request, it will retrieve the corresponding data from the server, compare with your own data to find the correct data and return it to the user.
Wait, isn't this a multi-copy solution? Indeed. Basically, the premise that the master-slave solution meets both reliability, availability, and consistency is that no server goes down. This is against the basic fact that "Nothing will expire. (The more bloody fact is that "anything that is impossible will happen" and always happens when you think it will not happen ). In addition, the consistency benchmark of the master-slave solution is built on the master server. After the master-slave server is switched, the entire cluster loses the consistency benchmark, it is forced to seek consistency on all master-slave servers. The multi-copy solution makes it clear from the very beginning that the server is not reliable, and the data on a single server cannot be complete and accurate. The reliability, availability, and consistency of the entire system are built on the data integration of all servers.
After accepting the premise that "Everything will expire", we can see that the master-slave solution also has to integrate the data of all servers for consistency. This is also equivalent to the multi-copy scheme, but it has to pay more costs, such as the master-slave switching logic and the data access latency of one master-slave forwarding.
However, the multi-copy solution is not perfect. Generally, it can only be used for simple data storage structures and data operation logic, such as simple operations such as map and insert/Delete. Operations involving multiple data, non-idempotent operations, or transactions make it very difficult to compare the consistency between copies.
For scenarios with low reliability and availability requirements, we often have to go back to the next step. We use the master-slave mode to ensure business logic is implemented at the cost of reliability and availability. When using the master-slave solution, we usually try to improve the reliability of the device, such as using raid to reduce the probability of errors.
Sometimes, we have high requirements for reliability, availability, and consistency, and the master-slave mode cannot meet the requirements. Therefore, you can only change the business logic and simplify it into a multi-copy architecture. Large Object Storage Service (OSS) has high requirements on reliability, availability, and consistency (it is a service of online services and relies on reliability and availability to make money. It is absolutely careless here ). Therefore, we have to discard any complicated data operations and simplify them to adapt to the multi-copy storage format. Therefore, large object storage systems adopt the form of <key, value>, which is not voluntary or wise and completely forced. The cloud storage system inevitably adopts the multi-copy solution.
The comparison between multi-copy and master-slave solutions is a very typical cloud computing problem (or "pitfall "). Many solutions can handle a specific requirement, such as reliability or availability. However, once these needs are integrated, these solutions will face various obstacles. On the other hand, an online system solution not only considers the implementation of functions under normal circumstances, but also the various problems that may occur when exceptions occur. The latter occupies of the cloud computing system architecture and design workload. In other words, the difficulty of cloud computing system success depends on how many pitfalls you have escaped ".