The principle and implementation of data consistency in distributed database

Source: Internet
Author: User
Tags ack

Original: http://database.51cto.com/art/201710/554743.htm

Objective

The data consistency management of distributed database is one of the most important kernel technologies, and it is also the guarantee of "consistency" (consistency) to satisfy the most basic acid characteristic of database in distributed database. With the development of distributed technology, the solution and technology of data consistency are also evolving, this paper introduces the principle and practical realization of distributed database data consistency based on the distributed database developed by the author.

1. Data consistency

1.1 What Is data consistency

Most DBAs using traditional relational databases when they see "data consistency," the first reaction may be the data consistency scenario of the data in a cross-table transaction. However, the "data consistency" described in this article refers to the scenario of "how data is guaranteed to be consistent when it is stored in multiple copies."

Because in the big data domain, the data security is no longer guaranteed by the hardware, but through the software means, through simultaneously writes the data to the multiple copies, guarantees the data security. When a database writes records to multiple replicas at the same time, how to ensure that each replica data is consistent, called "data consistency."

1.2 How relational databases guarantee data consistency

Traditional relational databases for the operating environment – hardware requirements are high, such as Oracle recommends that users use minicomputer + shared storage as the operating environment of the database, DB2 DPF also recommends that users use better server + high-end storage to build the database environment. Therefore, under the technical requirements of data storage security, the traditional relational database is more dependent on the hardware technology to ensure the security of the data.

Because the data security of the relational database is based on the hardware, and the data is not secure by storing multiple copies at the same time, the user of the relational database thinks the data storage is consistent by default.

1.3 How distributed storage guarantees data consistency

In this paper, when we discuss distributed storage, we mainly refer to Distributed File system and distributed database in big data products, for example: Sequoiadb and HDFs.

When users understand the principles of data consistency for distributed storage, they must first understand why they need data consistency, and how the data storage of distributed storage differs from the data storage of relational databases.

The advent of big data technology is truly a breakthrough in the performance of the system, and enables the hardware to scale horizontally to achieve linear growth in performance and storage. These are not available to traditional relational databases in the past. In addition, big data technology has abandoned the requirement that the operating environment must be good enough, but instead allows the user to build a cluster of clusters with a large, inexpensive X86 server + local disk, resulting in greater computational power and more storage space than previously relied on hardware vertical scaling.

The core idea of big data technology is distributed, which breaks down a large task into several small tasks, and then completes it by means of distributed concurrency operation, thus improving the efficiency of the whole system or storage capacity. In a distributed environment, due to the reduced requirements of hardware, big data products must provide another important function – data security.

Big data products in the way of solving data security, are relatively close, simply speaking, is to let a copy of the data by asynchronous or synchronous way to save on multiple machines, so as to ensure the security of data.

After solving the technical difficulties of data security, distributed storage introduces a new technical problem, which is how to guarantee the consistency of data in multiple replicas. Currently SEQUOIADB is using the raft algorithm to ensure that data is consistent across multiple replicas.

2.Raft algorithm

2.1Raft algorithm Background

In the distributed environment, the most famous consistency algorithm should be the Paxos algorithm, but because it is too obscure and difficult to achieve, so in 2013, Diego Ongaro, John ousterhout Two people to understand ( understandability), a set of conformance algorithm raft is designed for the target. The biggest feature of the raft algorithm is that it is simple to understand and simple to implement.

2.2Raft Algorithm Overview

Unlike Paxos, raft emphasizes simplicity, raft and Paxos as long as the N/2+1 node is able to provide services.

It is well known that when the problem is more complex, the problem can be decomposed into several small problems to deal with, Raft also uses the thought of divide and conquer. The raft algorithm focuses on solving three sub-problems: Elections (Leader election), log copy replication, security (Safety).

Raft algorithm strengthens the function of leader node, the data of follower node can only be obtained from leader, so the implementation of follower node becomes simple, as long as the responsibility and leader to maintain communication, and accept leader push data can be.

2.3Raft algorithm principle

2.3.1-node role

In the raft algorithm, the state of the node is divided into 3 roles, namely leader (leader), Follower (follower) and candidate (candidate).

Leader, responsible for processing requests from the client, responsible for synchronizing the logs to the follower, and to ensure that the heartbeat connection with follower;

Follower, when the cluster has just started, all nodes are Follower state, its work mainly in response to leader Log synchronization request, respond to the candidate request, and the request to Follower the transaction request forwarded to leader;

Candidate, the election leader is responsible for voting, after the election leader, the node will change from the candidate state to leader state.

2.3.2 Terms

In distributed environment, "Time synchronization" has always been a difficult technical problem. Raft in order to solve this problem, divide the time into one term (which can be understood as "logical time") to deal with data consistency at different time periods.

Terms has the following principles

    • In each term, there is at most one leader
    • In some term, there may be a situation in which there is no leader due to election failure
    • Each node maintains its own local currentterm.
    • Each term is a sequential incrementing number.
    • If the term number of the follower is longer than the other follower term number, the follower term number will be updated with the term number to remain consistent with the other follower term numbers

2.3.3 Elections

Raft elections are triggered by a timer, and each node has a different trigger time.

All nodes start with a state of follower, and when the timer triggers an election, the term number increments, the state of the node is changed from follower to candidate, and the Requestvote RPC request is initiated to the other nodes, in which case the election can occur in 3 ways:

The node initiating the Requestvote receives a poll of n/2+1 (majority) of nodes, which will change from candidate state to leader state and start sending heartbeat to other nodes to maintain leader's normal state

If a polling request is received, the node discovers that the node that initiated the poll is larger than its own, then the node state is changed from candidate to follower, otherwise it remains candidate state and rejects the poll request

A time-out occurred during the election, the term number is incremented and the election is re-launched

2.3.4 Log Replication

The primary role of log replication is to ensure data consistency and high availability of nodes.

When leader is elected, all transaction operations must be leader processed. When these transaction operations succeed, they are written sequentially to log, each containing an index number.

Leader after the log changes, the new log is synchronized to the follower via heartbeat, follower sends an ACK message to leader after receiving the log, when leader receives the majority (2/n+1) After the ACK information is follower, the log is set to committed, and leader appends the log to the local disk.

At the same time leader will notify all follower to store the log on their local disk in the next heartbeat.

2.3.5 Security

Security is the security mechanism used to ensure that each node is executed according to the same log sequence.

If a follower fails while synchronizing the leader log, but the follower may be elected as leader in the future, it is possible to overwrite the previous leader commit log, which causes the node to perform a different sequence of logs.

The security of raft is to ensure that the leader of the election must contain the mechanism of the previously committed LOG, which is mainly followed by the following principles:

Each term can only elect one leader;

Leader Log integrity, when candidate re-election leader, the new leader must contain the previously already commit log;

Candidate the use of term to ensure the integrity of the log when electing new leader;

3. Distributed database Data consistency technology implementation

Taking the distributed database sequoiadb as an example, sequoiadb in the multi-copy deployment, the raft algorithm is used to ensure that the data is consistent in the multi-replica environment.

The SEQUOIADB cluster contains 3 of the role nodes, namely the coordination node, the cataloging node, and the data node. Because the coordination node itself does not have any data, only the cataloging node and the data node have transactional operations, in other words, the copy of the cataloging partition group and the data partition group adopts the raft algorithm to ensure data consistency.

3.1 Description of transaction logs for catalog nodes and data nodes

Both the cataloging node and the data node need to store the data, and in the cluster deployment, in order to ensure the security of the data, it is recommended to deploy in a distributed way, so in the data synchronization, the basic principle of the raft algorithm should be used for data synchronization.

Cataloging nodes and data nodes, when storing data, consist of two parts, one real data file and the other a transaction log file.

The SEQUOIADB node transaction log, by default, consists of 20 64MB (total size 1.25GB) files. The transaction log for a node consists primarily of an index number and data manipulation content, and the index number remains permanently incremented.

In addition, the transaction log for the SEQUOIADB node is not persisted, but once all the transaction logs have been filled, overwrite the write again from the first file.

3.2 Data consistency for cataloging partition groups

Because the cataloging partition group is the meta information that holds the SEQUOIADB cluster, the data synchronization requirement is high, so the data consistency requirement of the cataloging partition group is strong consistency, that is, every time the transaction operation is performed to the cataloging partition group, it is necessary to ensure that all the cataloging node operations are successful before the operation is evaluated successfully. Otherwise, the transactional operation will roll back the transaction log in the entire catalog partition group to guarantee data consistency within the partition group.

In addition, the cataloging partition group has a more important feature, that is, the catalog partition group must exist the master node to be able to work properly, if the old primary node is down, the cataloging partition group does not have a master node temporarily, then the cataloging partition group can not provide any transaction operations and data query operations.

3.3 Data consistency for data partitioning groups

Data consistency for data partitioning groups is ultimately consistent by default, requiring only that the primary node perform a transactional operation to succeed as the operation succeeds, and the master node will synchronize replicalog to the slave node in the future.

3.4 Transaction log synchronization for master and slave nodes

The master-slave node of SEQUOIADB is guaranteed data consistency through transaction log synchronization, and the transaction log synchronization of master and slave node is single-threaded complete.

If the LSN gap between the primary and slave nodes is a single record, the primary node proactively pushes the most recent transaction log to the slave node.

If the LSN gap between the primary and the slave nodes exceeds one record, the slave node proactively requests the synchronization of the transaction log from the primary node, and after the master node receives the synchronization request, the transaction log that corresponds to the latest LSN number from the node's LSN number to the primary node is sent to the slave node at once.

3.5 Replay from Node log

Transaction logs and replay are resolved automatically when the transaction log is fetched from the node to the primary node. When the transaction log is replayed from the node, the transaction log is replayed by default with 10 concurrency.

There is a conditional restriction on the execution of concurrent replay logs from a node, that is, the INSERT, DELETE, UPDATE, LOB WRITE, lobupdate, LOB remove operations can support concurrent replay transaction logs in cases where the collection has a unique index number of <=1. When a node is doing concurrent replay, it is beaten and executed by the recorded OID, which guarantees that the operation of the same record does not result in inconsistent data due to concurrent replay.

However, users need to be aware that the DROP CL operation does not support concurrent replay when the transaction log is replayed from the node.

4.SequoiaDB Data Consistency Application

Data consistency for SEQUOIADB data partitioning groups is currently configured based on the collection level. Users can adjust the intensity of data consistency at any time while using the SEQUOIADB process.

4.1 Specifying when creating a collection

In a multi-replica sequoiadb cluster, the default data consistency row level for the collection is "final consistency." The user can explicitly specify the data consistency strength for the collection when creating the collection, for example, you can execute the following command in the SEQUOIADB shell

Db. Csname.createcl ("Clname", {replsize:3})

Replsize parameter Filling Range

4.2 Modifying a collection that already exists

If the collection does not set the data consistency replsize parameter when it is created, the user can also modify the existing collection, and the SEQUOIADB Shell modifies the command as follows

Db. CSName. Clname.alter ({replsize:3})

The value range of the replsize is consistent with the creation of the collection.

4.3 How to view the Replsize parameters of a collection

If the user wants to check the Replisize parameter value of the current collection, it can be viewed through a database snapshot, and the sequoiadb shell looks at the command as follows

    1. Db.snapshot (sdb_snap_catalog,{}, {"Name":null, "ISMAINCL":null,"mainclname":null, " Replsize ":null})

Print information as follows

  1. {
  2. "Mainclname":"test.main2",
  3. "Name": "Foo.bar2",
  4. "ISMAINCL": null,
  5. "Replsize": null
  6. }
  7. {
  8. "Ismaincl": True,
  9. "Name": "test.main2",
  10. "Mainclname": null,
  11. "Replsize": null
  12. }
  13. {
  14. "Name": "Foo.tt",
  15. "Replsize": 3,
  16. "ISMAINCL": null,
  17. "Mainclname": null
  18. }

5. Summary

Distributed database, through the raft algorithm to ensure the consistency of the data in the distributed situation, and the cataloging partition group and data partition group on the data consistency requirements are different, the cataloging partition group always requires that the data in the multi-copy case data strong consistency, The data partitioning group can be used by the user when creating the collection to perform data consistency intensity, the higher the intensity, the better the data security, but the efficiency of execution will be relatively poor, and vice versa.

At present SEQUOIADB in the data consistency scene, the user's adjustment space is big, can adjust the data consistency intensity according to the different business requirements, to satisfy the business or pursues the performance optimal, or the data safest technical request.

Principle and implementation of distributed database data consistency

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.