In distributed storage systems, system availability is one of the most important indicators to ensure that the system availability is not impacted in the event of a machine failure, and in order to do so, the data needs to hold multiple copies, and multiple copies are distributed across different machines, as long as the data of multiple replicas is consistent, When a machine failure causes some replicas to fail, the other replicas can still provide services. This article mainly describes the way of data backup, and how to ensure the consistency of multiple copies of data, in the event of machine or network failure, how to maintain the high availability of the system.
Data backup
Data backup refers to the storage of multiple copies of data, backup methods can be divided into hot spare and cold, hot standby is to provide a copy of the service directly, or in the case of failure of the primary can immediately provide a copy of the service, cold is used to recover the data copy, typically generated by dump.
The distribution of data hot standby by copy can be divided into homogeneous system and asynchronous system. Homogeneous system is divided into several groups of storage nodes, each group of nodes to store the same data, one master node, the other is a standby node; Heterogeneous systems are dividing data into many shards, multiple copies of each shard are distributed across different storage nodes, and the storage nodes are heterogeneous, that is, the collection of data shards stored by each node is not the same. In homogeneous systems, only the primary node provides write service, the standby node only provides read service, each master node can be different, so that there will be more flexibility in the deployment. In heterogeneous systems, all nodes are able to provide write services, and in the event of a node failure, there will be multiple nodes participating in the failure node's data recovery, but this method requires more metadata to determine the primary replica of each shard is located on the node, the data synchronization mechanism is also more complex. In comparison, heterogeneous systems provide better write performance, but are more complex to implement, while homogeneous system architectures are simpler and more flexible to deploy. In view of the fact that most of the Internet business scenarios have less read-write features, we have chosen a more easily implemented homogeneous system design.
As shown in the System data backup architecture, each node represents a physical machine, all nodes are divided into groups according to data distribution, each group of primary and standby nodes store the same data, only the primary node can provide write service, the master node is responsible for synchronizing data changes to all the standby nodes, all nodes can provide read services. The primary node distributes the entire amount of data, so the number of master nodes determines the amount of data that the system can store, and when the system capacity is insufficient, the number of master nodes needs to be enlarged. In the processing capacity of the system, if the ability to write is insufficient, can only be solved by expanding the number of master nodes, and when the write ability is insufficient, you can increase the standby node to improve. Each master node can have a different number of spare nodes, which is particularly useful when the data heat is not the same for each node, and can be used to increase the processing power of the system by adding more nodes to the hot nodes.
Synchronization mechanism
In the above backup schema, only the master node receives the write request, then the master node is responsible for synchronizing the data to all the spare nodes, as shown, the master node is synchronized in a one-to-many way, in the way of cascading, this way in the case of a node failure, it will not affect the synchronization of other standby nodes. In the CAP theory, the availability and consistency is a pair of contradictions, where the primary node after the write operation will immediately reply to the client, and then asynchronously synchronize the data to the standby node, so that there is no guarantee that the primary and standby node data strong consistency, the primary and standby data will have a brief inconsistency, by sacrificing a certain consistency to ensure system availability. In this mechanism, the client may read the old data in the standby node, if the business requires strong data consistency, you can set the read-only main option in the read request, so that the read request will be forwarded by the interface layer to the master node, in this case, the standby node is only used for disaster recovery, not provide services.
In order to ensure the data consistency of the primary and standby nodes, an efficient and reliable data synchronization mechanism is needed. Synchronization is divided into incremental synchronization and full-volume synchronization, incremental synchronization is the primary node to forward the write request directly to the standby node execution, full-volume synchronization is the primary node to the local data sent to the standby node to cover. Next, the implementation of the synchronization mechanism is described in detail, as shown in the overall process of synchronization.
The unit of Data shard in the system is the Vnode (virtual node) in the consistent hash ring, each vnode has a self-increment synchronization sequence number Syncseq,vnode each write of the data contained in it will trigger its syncseq to self-increment, This allows each write to be syncseq within each vnode, and the size of the SYNCSEQ reflects the order in which the write is executed. Each write operation of the data, in addition to modifying the data, will also save the corresponding syncseq of the write operation, can be seen later, SYNCSEQ is the basis of the synchronization mechanism reliability.
After the write process of the master node receives the write request, the data is modified, the current Vnode Syncseq is added 1 and updated to the data. Next you will record Binlog,binlog is a ternary group <vnode, Syncseq, Key> a write operation that uniquely identifies the entire system. Binlog writes to the Binlog queue and Binlog cache, the Binlog queue is written to the Binlog file by other processes, and the Binlog cache is an obsolete hash table for quick lookups. The write request is then cached in an obsolete hash table, the write request is used for incremental synchronization, the hash table stores ternary <vnode, Syncseq, Req>, where the write request packet size is limited, and the write request exceeds the limit to cache. The final write process updates the synchronization progress table, as shown in the synchronization schedule records the progress of each Vnode Master node synchronization to each standby node, the syncseq of the master node is the syncseq of each vnode last write operation, the syncseq of the node is the largest syncseq that has been synchronized.
The data synchronization of the primary and standby nodes is performed asynchronously by the synchronization process on the master node, and the syncseq differences between the main and standby nodes in the Synchronized Progress table are scanned to know what data the nodes need to synchronize. The synchronization process determines the two-tuple <vnode that need to be synchronized by synchronizing the progress table, SYNCSEQ>, the write request cache is searched first, and if found, the write request is sent to the standby node for incremental synchronization. If not found in the write request cache, then go to binlog Cache and Binlog file to find the corresponding Binlog, through the binlog to the full amount of data synchronization. Finally, we update the synchronized spare node syncseq in the Synchronization progress table, and a complete data synchronization process has been completed.
The next step is to explain how the synchronization protocol ensures efficient and reliable synchronization. In order to make the synchronization package arrive at the standby node strictly in the order of the primary node, the TCP protocol is used to synchronize, and a TCP connection is made to each standby node on each vnode of the master node, which is recorded as a synchronous connection. On each synchronous connection, the primary node sends multiple synchronizations at once, the standby node also records the synchronized syncseq, checks whether the carrying Syncseq is expected for each synchronization package, and, if expected, performs a synchronous write operation that updates the synchronized syncseq. In this case, the write node does not need to respond to the primary node, and the master node will assume that synchronization is normal when the response to the standby node is not received. The standby node responds to the primary node only if the following exception is the case:
- The first time after the normal synchronization received the error syncseq, in response to the master node's own expected syncseq, the master node received a response, the standby node will be expected to start synchronization syncseq, it should be noted that the standby node in the continuous error syncseq, only the first error response, Otherwise, the primary node will have repeated synchronization;
- When the synchronization connection is reconnected after the disconnection, the standby node informs the master node that it is expected to start synchronizing the Syncseq, and the master node begins to synchronize with the SYNCSEQ;
- Syncseq meet the expected but execution error, generally is the incremental synchronization can occur, the standby node responds to the master node synchronization error, the master node received a response, the error synchronization packet to full-volume synchronization.
In the case of incremental synchronization and full synchronization crossover, if a full-volume synchronization is synchronized with the latest data, subsequent incremental synchronizations may cause the write operation to be repeated, in order to avoid this situation, the standby node verifies the SYNCSEQ in the synchronization package and the SYNCSEQ in the data, if the former is less than the latter, The explanation is that the data has been written, skipped directly, and does not need to respond to the master node, which is why you need to save syncseq in your data.
Through the above introduction and analysis, we can see that synchronous connection, batch synchronization method, under normal circumstances only one-way synchronous traffic, is very efficient, and in the abnormal situation, through the error response, SYNCSEQ Verification mechanism, to ensure the reliability of synchronization.
Disaster tolerant mechanism
If the system needs to be capable of disaster tolerance, that is, in the event of a machine failure, the system's availability is basically unaffected, then all the data in the system must have at least two copies, and the system's processing power to have a certain degree of redundancy, it is necessary to ensure that the fault machine can not provide services, the system will not overload. In general, the more copies of data, the more redundant the system's processing power, and the stronger the system's disaster-tolerant capacity. Further, physical deployment needs to be considered, by distributing different copies of the data in different racks, in different computer rooms, and even in different cities, to elevate the system's disaster resilience to different levels.
Configure the operations center to monitor the status of all nodes in the system storage layer, the storage node will periodically escalate the heartbeat, if the Operation center is configured to not receive a memory node's heartbeat for some time, then the status of the node is marked as failure, and the process of failure. First of all, it is necessary to prohibit the failure node to continue to provide services, that is, notify the interface layer no longer the client request forwarding of the failure node, if the fault node is the primary node, the Configuration Operations Center will query and compare the synchronization progress of all the standby nodes, select the latest data backup node, switch it to the primary node. Since all the standby nodes also record Binlog, after switching to the primary node, you can synchronize directly to the other standby nodes. The primary and standby switchover here may result in a small amount of data loss, and if the business does not tolerate such data loss, other strongly consistent scenarios will be required.
After the disaster-tolerant switchover, recovery of the failed node is also required so that the system can return to its normal state. After the fault machine recovers, it will enter the panic recovery process, regardless of whether the fault node is the primary or standby node before the failure, and the role after the recovery is a standby node. First, the node to be restored needs to empty all the data on the machine, then the master node will copy all the current Vnode syncseq to the node to be recovered and replicate all the data, and after the full copy is complete, data synchronization is started, which is known by the previous synchronization mechanism. The synchronized syncseq will start to catch up from the state that was previously copied to the node to be recovered, and the role of the node to be recovered becomes a standby node as the SYNCSEQ difference between the primary node and the node to be recovered is reduced to the normal range, and services are started.
Configure the operations center to monitor the syncseq differences between the primary and standby nodes, if a node difference reaches a certain threshold, the standby node is forbidden to provide services, if the difference is still not recoverable after a long time, it will trigger the panic recovery process.
Data back File
Finally, we introduce the data cold and back file, which is mainly responsible for the backup system. Backup tasks are usually initiated manually or regularly, belonging to the business level, the backup system after receiving a business backup task, will remotely back up all the data of the business, the process is relatively simple, is to traverse all the storage nodes, all the data belonging to the business is written to the remote file system, Each backup requires a record start time and end time as a benchmark for data back.
All write operations in the system record a remote stream, each of which records the timestamp of the write operation, which is stored uniformly by the water center. Combined with data cold and pipelining, the data can be recovered at any time after the cold standby is completed. After the backup system receives a business rollback task, the service of the business is stopped first, then all the data of the business is emptied, then a full recovery from the cold standby is made, and then the stream is replayed to a specified point in time to complete the data return. Note that the cold is not a snapshot here, when the cold standby, the write operation is also normal execution, so from the cold standby start time replay water will cause a lot of write operations repeatedly, here through the data version check to avoid this problem, in the data to save the version information, In the write operation Pipelining also records the corresponding write operation after the completion of the data version, replay water, if the record in the water is not newer than the version of the data, then skip the water directly, so that the accuracy of the data back file.
Distributed Storage System Design (4)--Backup disaster recovery