Background
Recently, @ alizhengxiang (teacher Yang) posted a microblog on the top. Who knows, Shi has stirred up waves and friends from various database fields in China have posted a few new topics on this microblog, both radical debate and attack are not enough. In general, you should pay attention to one of the following points:
Without using shared storage, traditional RDBMS (such as Oracle, MySQL, PostgreSQL, and so on) can achieve zero data loss when the master database fails.
After the topic was detonated, our team had a heated debate and each party held a single word. During the debate, the scene of the Ukrainian parliament was almost reproduced...
Fortunately, under my ironclad rule, the students kept focusing only on technology, and the tough atmosphere of personal attacks did not rise to the level of mutual attacks. The result of the debate is indeed full of gains. At that time, I immediately sent a microblog to express my pleasure. J.
After Weibo posts, some friends replied whether they could write the content of the debate. I am also right. I have a series of knowledge, such as strong data synchronization, data consistency, performance, partition availability, Paxos, Raft, and CAP. I am also able to better organize and write it down for the first time, first, you can deepen your impression. Second, you can mix up some more virtual names.
The next part of this blog article will jump out of any database and analyze the following problems from the perspective of principles:
Problem 1: Data consistency. Without using shared storage, traditional RDBMS (such as Oracle, MySQL, PostgreSQL, and so on) can achieve zero data loss when the master database fails.
Question 2: partition availability. How can we ensure the continuous availability of the system when there are multiple copies of the database?
Question 3: performance. If you do not use an RDBMS with shared storage, will performance be compromised to ensure data consistency between multiple copies? How can we minimize the performance loss?
Question 4: analysis of an extreme scenario.
Question 1: Data consistency
Q: Is it impossible for traditional relational databases to achieve strong consistency between the master and slave databases without shared storage?
A: My answer is No. Even without shared storage, any database can achieve strong data consistency between the master and slave databases. So are Oracle, MySQL, PostgreSQL, and OceanBase.
How can we achieve strong consistency between the master and slave databases? Everyone knows the most important technology in databases: WAL (Write-Ahead-Logging ). UPDATE operation write Log (Oracle Redo Log, MySQL Binlog, etc.). When a transaction is committed, ensure that the logs generated by the transaction are first flushed to the disk to ensure that the update operation data of the entire transaction is not lost. The method for achieving strong data consistency between the master and slave databases is also simple:
When a transaction is committed, two log write operations are initiated at the same time. One is to write the log to the local disk, and the other is to synchronize the log to the slave database and ensure that the log is stored on the disk;
The master database returns to the application only after the two operations are successfully returned. The transaction is committed successfully;
The logic of the entire transaction commit operation is shown in the following figure:
As shown in the figure above, the logs generated by the transaction already exist in both the master and slave databases and are strongly synchronized when the transaction commit operation is returned to the application. Therefore, if the master database Crash is used, the backup database provides services, and its data is consistent with that of the master database, so there is no transaction data loss problem. Strong data consistency between the master and slave databases. If you have used Oracle, you should know that Oracle's Data Guard can work in three modes: maximum performance, maximum availability, and maximum protection. The third mode is the maximum protection mode, this is the basic idea in the figure above.
After implementing strong data synchronization, we will consider availability issues. Currently, there are two databases with completely consistent data. The primary significance of the backup inventory is to take over application requests when the master database fails to ensure that the entire database can continuously provide services: master Database Crash, slave database upgraded to master database to provide external services. At this time, another decision-making problem is involved. Who will perform the master-slave switchover operation? Of course, you can switch the slave database to the master database manually after receiving the alarm of the master database crash. However, manual efficiency is low, not to mention the possibility that the database may crash at any time. It is not good enough to let people handle it all. An HA (High Availability) detection tool came into being: the HA tool is generally deployed on a third server and connects to the master and slave databases. When it detects that the master database cannot be connected, it switches to the slave database, the processing logic is very simple, as shown in the following figure:
The HA software is connected to both the master and slave databases, and has timed heartbeat detection. After the master database Crash, HA detects that it initiates an operation to promote the slave database to the master database (modifying the VIP or DNS of the slave database, and possibly a series of operations such as activating the slave database ), the new master database provides external services. At this time, because the data in the master and slave databases is synchronized by means of strong logs, there is no data loss and data consistency is guaranteed.
With strong log-based data synchronization and HA software for automatic master/slave failover, is it all right? I would like to say that this architecture can solve more than 90% of the problems, but in some cases, this architecture has also laid a few large problems.
First, there is a clear problem. After the master database is Crash and the slave database is upgraded to the master database, the database is a single point of failure. During the restart of the original master database, the single point of failure has always existed. If the new storage crashes again at this time, the entire system will be unavailable. This problem can be solved by adding more replicas and more slave databases. For example, 3 replicas (one master and two slave databases) are skipped.
Second, in the master-slave environment, it is relatively simple to handle master database crashes, and the decision is simple: the master database Crash and slave database switch. However, if it is not the master database Crash, but the network has some problems, as shown in the following figure:
If problems occur between the Master and Slave networks, such as network disconnection and network jitter. What should the database do at this time? Does the Master continue to provide services? If the Slave does not synchronize logs, data will be lost. Does the Master provide services? The application is unavailable. In Oracle, if it is set to the maximum available mode, the service is still provided, and data inconsistency is allowed. If it is set to the maximum protection mode, the Master does not provide services. Therefore, in Oracle, if it is set to the maximum protection mode, it is generally recommended to set two or more Slave. If any Slave log is successfully synchronized, the Master will continue to provide services and system availability.
Network problems not only occur between Master and Slave, but also between HA and Master, HA and Slave. Consider the following situation:
There is a problem with the network between the HA and the Master. At this time, the HA faces two choices:
The connection between HA and Master fails, and the Master database is considered Crash. Select to promote the slave database to the master database. But in fact, there is a problem with the network between the HA and the Master. The original Master database is good (not downgraded to a slave database, or shut down) and can still provide external services. The new master database can also provide external services. The two master databases cause dual-write problems, which are the most serious.
The connection between the HA and the Master database fails, and the Master database is not Crash due to network problems. HA selects not to perform any operation. However, if the database is Crash, the HA will not perform any operations and the database will not provide external services. Dual-write is avoided, but application availability is affected.
Finally, the database may have problems, and the network between databases may have problems. Therefore, the HA software may also have problems. As shown in the following figure:
What if there is a problem with the HA software itself? We deploy HA to ensure the continuous availability of database systems in various scenarios. But who can ensure the continuous availability of HA itself? Do we need to take the primary role for the HA and then create another layer of HA on The HA? Add them one layer by one... ...
In fact, the problems mentioned above are actually Consensus in the classic distributed environment. In recent years, Lamport's popular Paxos protocol and the Raft protocol recently published by Stanford University, to solve this problem. (If you are interested in the Raft protocol, you can read another dynamic demonstration PPT of Raft: Understandable Distributed Consensus)
Question 2: partition availability
Previously, we answered the first question: can I ensure strong data consistency between the master and slave databases without using shared storage? The answer is yes. However, through the previous analysis, we raised the second question: how to ensure the continuous availability of the database in various circumstances? At least the HA mechanism mentioned above cannot be guaranteed. Can we introduce distributed consistency protocols like Paxos and Raft to solve the problems mentioned above?
The answer is yes. We can introduce the Paxos and Raft protocols to solve the problems mentioned above and ensure the continuous availability of the entire database system. Considering that the master-slave strong consistency system composed of two databases still uses HA for master-slave monitoring and failover, let's review the two new problems introduced in the previous section:
How can I ensure the availability of the HA software itself?
If the HA software cannot access the master database, is it the master database Crash? Or is there a network problem between the HA software and the master database? How can we ensure that there are no two master databases and no dual-write issues?
How can we ensure the continuous availability of the database while solving the above two problems?
To solve these problems, the new system is as follows:
Compared with the previous system, we can see that the complexity of this system is significantly increased, and more than one. The database is still a master-slave database with strong data synchronization. However, there are many other changes, including:
HA Client is deployed on the database;
The original HA host was extended to three HA hosts. One is the HA Master, and the other is the HA Participant ant;
The HA host performs two-way communication with the HA Client. The HA host needs to check whether the DB of the HA Client can provide services. This is consistent with the original one. However, a new HA Client is added for Master Lease communication to the HA host.
Can these changes solve the above two problems? Let's analyze them one by one. First, how can we ensure the availability of the HA software itself?
It is to solve this problem to increase the number of HA hosts to three. The HA service is stateless. Three HA hosts can be automatically selected through Paxos/Raft. I will not repeat the logic of selecting a master here. It is not the focus of this article. To learn more about its implementation, refer to the Paxos/Raft article on the Internet. In short, by deploying three HA hosts and introducing the Paxos/Raft protocol, HA service high availability can be solved. HA software availability is guaranteed.
The first problem is solved. Let's look at the second problem: how can we identify whether the current problem is a network fault or a master database Crash? How can we ensure that there is only one primary database in the database to provide external services under any circumstances?
By deploying the HA Client on the database server and introducing the Lease mechanism from the HA Client to the HA Master, this second problem can also be perfectly solved. The so-called lease mechanism from the HA Client to the HA Master is to say that the database instance in the figure does not always hold the right of the Master database (or the slave database. The time for the current master database to be in the master database state is limited, for example, 10 seconds. Every 10 seconds, the HA Client must initiate a new lease to the HA Master to renew the Master database status of the database where it is located, as long as it receives a confirmation that the HA Master agrees to renew the lease every 10 seconds, the current master database will not be downgraded to a slave database.
The second problem can be subdivided into three scenarios:
Scenario 1: the master database Crash, but the server where the master database is located runs normally, and the HA Client runs normally
The master database Crash and HA Client are running normally. In this scenario, the HA Client sends a request to the HA Master to discard the Master database lease. The HA Master receives the request and directly promotes the slave database to the Master database. The original master database runs as the slave database.
Scenario 2: The Crash of the host where the master database is located. (Both the master database and the HA Client are Crash)
In this case, communication between the HA Master and the HA Client fails because both the HA Client and the Master database Crash. At this time, the HA Master cannot immediately promote the slave database to the Master database, because it cannot distinguish scenario 2 from scenario 3 (network problems ). Therefore, the HA Master will wait for the lease time (for example, 12 seconds). If there is still no renewal message within the lease time. The HA Master promotes the slave database to the Master database to provide external services. After the host where the original master database is located is restarted, it is ready for running in the database status.
Scenario 3: the Master database is normal, but the network between the Master database and the HA Master fails.
For the HA Master, scenario 2 and scenario 3 cannot be distinguished. Therefore, the HA Master will process scenario 3 in the same logic as scenario 2. Wait until the lease term expires. If no renewal message is received, upgrade the original slave database to the master database. However, before upgrading the slave database, the HA Client of the original master database needs to do some additional work. The HA Client of the original Master database sent a lease renewal request to the HA Master. Due to network problems, the HA Client has never received a response. After the lease time expires, the local Master database is downgraded to a slave database. As a result, when the HA Master promotes the original slave database to the Master database, the original Master database has been downgraded to the slave database by the HA Client. Dual-Master is eliminated, and applications cannot produce dual-write.
Through the analysis of the above three scenarios, question 2 is also resolved in this architecture. In the process of solving problem 2, the system needs to wait for the lease setting time at most. If the lease is set to 10 seconds, a variety of problems may occur, and the database can be stopped for up to 10 seconds, basically, continuous availability is achieved. The service suspension time depends entirely on the lease time settings.
At this point, we can basically say that a database system with continuous availability (guaranteed partition availability) and strong data consistency between the master and slave databases is no problem at all. It is also possible to transform the existing database system. However, considering the actual implementation, the complexity is very high. The Master-slave switchover of the database is implemented inside the database. Here, the HA Master is used to upgrade the Master database; the HA Client is used to downgrade the slave database; after the database crash is restored, the database is restored to the slave database; the lease mechanism of the master database is implemented through the HA Client; the availability of the HA host is realized; all of these are implemented on the basis of the existing database, which is quite difficult. If you are interested, you can discuss this issue.
Question 3: performance
Data consistency can be solved through strong log synchronization. Partition availability: in case of any exceptions, the system continues to be available. You can introduce Paxos/Raft and other distributed consistency protocols based on strong data synchronization, although there is no mature implementation at present. Next let's take a look at a question that many of our friends are very interested in: how can we ensure high performance while ensuring strong synchronization? Return to the first figure in this article:
To ensure strong data synchronization, the application must synchronize the transaction logs to the Slave and log on to the disk when initiating a transaction submission request. Compared with the asynchronous write Slave, the synchronization mode has one more network interaction from the Master to the Slave, and one more disk sync operation on the Slave. At the application level, the time for a Commit operation must be increased. The specific increase depends on the network latency from the master database to the slave database and the disk performance of the slave database.
To improve performance, the first simple idea is to deploy multiple server load balancer instances. If one server load balancer log is returned after synchronization, and the local Master log has been stored on the disk, after you submit the operation, you can return the result. The deployment of multiple Slave instances is very effective in eliminating transient network jitters. In the official Oracle recommendation, if the maximum protection mode is used, it is also recommended to deploy multiple Slave instances to minimize the impact of network jitter. If two Slave instances are deployed, the new deployment architecture is shown as follows:
Add a new Slave with three copies of data. When two Slave logs are synchronized, transactions can be committed, greatly reducing the impact of network jitter. After a replica is added, it can solve the data security problem after the primary database Crash. Even if the primary database Crash, there are still two replicas that can provide services without a single point of failure.
However, after the introduction of three copies of data, a new question is introduced: which slave database is selected as the new master database when the master database Crash is used? Of course, the right to select a Master is still exercised by the HA Master. But how should I select the HA Master? The following judgment criteria can be used to solve this problem:
Log first. If the Slave has the latest log, select the Slave as the new master database.
Host-level priority. If the two Slave instances have the latest logs at the same time, what should I choose? In this case, you can select any one. For example, you can select Slave based on the IP address of the Slave host, and select Slave with a small IP address as the new master database. It can also solve the problem.
After the new Master database is selected, the first thing that needs to be done is to synchronize the logs of the new Master database and the remaining Slave to ensure that the logs of the two databases are consistent, provide services to applications. At this point, the triplicate problem degrades to the dual-copy problem. The triplicate mechanism prevents network jitter. However, due to the strong synchronization between the two copies, data reliability and consistency can still be guaranteed.
Of course, in addition to this simple three-copy optimization, more optimization can be done. The optimization idea is generally synchronous to asynchronous processing, such as transaction commit and log writing operations; use more fine-grained locks; key paths can adopt lockless programming.
Strong synchronization of multiple replicas does not necessarily result in system performance loss. What should it look like? My idea is:
RT increases for a single transaction. The response latency will definitely increase (at least one more network RT and one more disk Sync );
Throughput remains unchanged for the entire database system. Remote network RT and disk Sync do not consume local CPU resources, and local CPU overhead does not increase. As long as asynchronization is done well, the throughput of the entire system will not be reduced due to the introduction of strong synchronization.
Question 4: analysis of an extreme scenario
I have reserved a small assignment for my friends who are still insisting on reading it. Consider the following figure: if the user submits an operation, the master database crashes before step 2 is completed, or before step 2 is completed. At this time, the slave database has the latest transaction commit records, and the crashed master database may have the latest commit records (completed in step 1 and crashed before step 2 ), or there may be no latest records (crash before 4th steps). What should the system do?
After the article is published on the blog, we find that you are particularly interested in this last question. I have chosen some comments from my friends on this issue for your reference only.
@ Taobao Dingqi
The last question is essentially irrelevant to the master and backup. To simplify this process, in a single database scenario, db local transactions are committed, crash before ack is returned, or the client determines that the ack packet has timed out before it arrives... Therefore, as long as the client does not receive a clear success or failure, both critical transactions are acceptable. In the master-slave environment, you only need to ensure that the system is consistent.
The following figure shows Ding Qi's opinion in a graphical manner:
This figure is much simpler than Problem 4. The database has no master/slave and only one single database. When the application initiates a Commit operation, logs are stored on the database, but failed to return the application message (network reason? Timeout ?). Although the architecture is simplified, the problem is similar. At this time, the application cannot determine whether the current Commit is successful or failed. In this status, the application must handle the error processing logic.
@ Arthur UrhG
The key to the last problem is to solve the server-side consistency problem. The master can be synchronized from the slave, or the slave can be rolled back, because the client does not receive a successful message, so what can be done. After an agreement is reached on the server side, the client can submit the statement again. To implement idempotence, each transaction is assigned a unique ID. Alternatively, the client queries the statement and then determines whether to submit the statement again based on the result.
Summary
I wrote a bunch of nonsense and finally made a small summary:
All the friends here are truly true love. Thank you !!
Can various mainstream relational database systems achieve strong consistency between the master and slave databases and ensure data consistency independent of storage?
Yes. Similar functions are available in Oracle, MySQL 5.7, Alibaba Cloud RDS, and NetEase RDS.
Can various relational database systems provide continuous availability and high performance based on strong data consistency between the master and slave databases?
Yes, but it is difficult. Currently, mainstream relational databases lack this capability.