Oceanbase 1.0 of distributed transactions

Source: Internet
Author: User

Oceanbase 1.0 of distributed transactions

The database is powerful and complex, where "transaction (Transaction)" is the function that the user will use unconsciously. As the engineer of the development database, we devote a lot of energy and time to the function of the transaction, and know that the database system realizes the transaction is paid a great price. This price includes not only the work of database software development, but also the cost of database operation. In other words, if the database discards transactional functionality, it can achieve better performance when other conditions are unchanged. When the database software just appeared, there is no transaction this function, but in this case, the use of database development software often can not guarantee the correctness and consistency of the data, or make the software complex. Therefore, the database supports transactional functionality, providing the user with a capability to package the operations of many databases, such as Atomic (atomicity), to guarantee multiple modifications to the data within a transaction, or to complete or roll back all of them. While paying a lot of cost, the user's application is greatly simplified. Furthermore, the process of user manipulation of data is becoming more complex, and without the assurance of transactional attributes, it is even more difficult for users to know for sure whether the data remains correct in the database.

The 4 properties that are important for a transaction are ACID atomicity (atomicity), consistency (consistency), isolation (isolation), durability (persistence), respectively.

    • Atomicity (atomicity) indicates that multiple operations in a transaction are either complete or not in effect and do not appear in the middle State;
    • Consistency (consistency) indicates that transactional operations do not violate database consistency constraints;
    • Isolation (isolation) indicates that multiple concurrent transactions do not affect each other;
    • Durability (persistent) indicates that a transaction will not be lost once it succeeds.

Database systems have many engineering methods to implement these transactional features. This ensures that all data is stored in a persistent device, which guarantees the durability (persistence) of the data. Commonly used persistent devices are disks and SSDs, which ensure that data is not lost when power is turned off. The consistency of the transaction is the data constraint of the database system, the common constraint is the foreign key of the database (Foreign key), and the database system guarantees that these constraints will not be broken after the transaction execution. These two function points, on the basis of atomicity (atomicity) and isolation (isolation), the system is relatively simple to implement. So, in this article, we only describe in detail the atomicity (atomicity) and the isolation (isolation) of two attributes.

Atomicity (atomicity)

Atomicity (atomicity) is the most important feature in a transaction, and if everything works, it is not complicated to ensure atomicity (atomicity). Guaranteed Transaction atomicity (atomicity) is more complex than exception handling, such as downtime recovery, primary and standby switching, and so on. As an example of modifying two rows of data in a transaction, if the machine goes down before the first line is modified and the second line is modified, if there is no other mechanism to ensure that the service is restored to the machine, only the first modification of the line is persisted in the system, violating the requirements of atomicity (atomicity).

What is atomicity (atomicity)

One way to achieve atomicity (atomicity) is to place multiple operations on an atomic operation at an effective time. In the memory operation, the computer system can do atomic operation to assign value to a memory variable, CAS operation and so on. The atomic operation of the hard disk and the hardware itself, the disk is generally a 512-byte block write is atomic, SSD is generally 4KB block write is atomic. This atomic implementation mechanism is illustrated with an example of a data structure.

struct Balance {    int  accounta;     int  New Balance ();

The above is a C + + structure that represents the balance of two accounts a and B. If you transfer from A 5 yuan to B, the corresponding C + + code is as follows:

5 ; x 5;

The above code is not atomic, in the middle of two statements, if another thread reads the values of AccountA and ACCOUNTB, it will find that the transfer operation is only half done. If you use the following code, you can guarantee that no matter when you read it, you will not be able to read half of the transfer execution:

New Balance (); TMP 5 ; TMP 5  = tmp;

The effective time of the operation is a x = tmp; statement, and the assignment of a single variable is atomic, guaranteeing that the entire transfer operation is atomic.

Using logs for atomicity (atomicity)

The above-mentioned implementation based on atomic operations is very common in data structures, but now the database system uses log technology (logs) to achieve atomicity (atomicity) because the log technology is more flexible in its approach to atomicity, and it also solves the need for transactional persistence (durability). And it has better performance than direct persistent data, so almost all database systems use log technology, even file systems, queue systems, and so on.

When using log, the operation of the whole transaction is encoded into continuous log information, and then the log information is written to the persistence device, when the logs are written successfully, it is the time to ensure the success of the atomicity of the transaction. As an example of the above transfer operation, the encoded log information is as follows:

<accounta, Balance,-5;, <ACCOUNTB, balance, +5>

When the above information is persisted successfully, the database system changes the data of two accounts, and the account data itself is persisted to the hard disk time can be postponed to any time, because even if the data is not persisted before the downtime, the system after the restart can still recover the above transfer operation from the log. Therefore, the timing of the success of the Atomic (atomicity) Transfer transaction is the time when log persistence succeeds.

For a database transaction, if there is only such a log and the length is less than one atomic write operation of the hard disk, such as 512 bytes of disk, then atomicity is very easy to guarantee that the write succeeds, the transaction succeeds, and if the write fails, the transaction is not completed. If a transaction only needs to write a log but the log length is greater than the size of an atomic write, additional means are required to ensure the atomicity of log writes. A common method is to use the length plus checksum (checksum), which contains the length and checksum value (checksum) of the entire log in the header of the log. Read the log from the log file, first read to the log header, to get the log length and checksum value (checksum), and then according to the log length of the entire log content read out, and the checksum value (checksum), if consistent, then both the log is considered complete, the corresponding transaction is also a successful submission If the checksum value (checksum) is inconsistent, the transaction is considered to be unsuccessful. If the information for a transaction is written to the hard disk multiple times, the transaction contains more than one log. In many of these logs, each log is treated the same way as described earlier, but whether the transaction ultimately guarantees that the atomicity depends on the success of the many logs, so the database system will be successful after all the logs are persisted and then write the last confirmation log, only the last log write success, the transaction is successful, otherwise , the transaction will still be rolled back.

Principles of distributed transactions

This approach applies only if all the logs for a transaction are in a single log sequence. When a database is made up of multiple machines, each machine will have its own log, and if a transaction involves more than one machine, the log will be written to multiple machines, which we call a distributed transaction. Theoretically, even if the log of the transaction is scattered across multiple machines, when the transaction commits, if all the machines persist the log successfully, the transaction is successful, and the atomicity (atomicity) can be guaranteed. The trouble is that if the machine restarts and the transaction state is resumed from the log, it is also necessary to ask all the machines to confirm that all the logs for the transaction have persisted successfully. However, the actual system cannot afford to have multiple machine traffic per transaction each time a transaction state needs to be resumed. Therefore, the distributed transaction adopts two-phase commit method, and the process of single-machine log is extended to suit the distributed transaction.

The typical distributed transaction process is as follows:

The left (a) is the coordinator state machine, and the right side (b) is the participant state machine. The facilitator is the main body that drives the entire transaction submission process, which can be on any machine and, of course, with one of the participants on a single machine. The actor is the performer who actually does the transactional operation. The facilitator first notifies all participants to persist (the prepare command in the diagram), and when the participant persists the log of the transaction, it responds with prepare OK, and when all participants reply prepare OK, it means that the entire transaction is complete, and the coordinator writes the log of the transaction commit , and sends a commit to all participants, if any one of the participants returns a failure (that is, abort OK), then the coordinator considers the transaction to be a failure, writes down the log of the transaction rollback, and sends an abort to all participants.

In this process, the latency of the distributed transaction submission is 2 writes (the participant writes the prepare log + Coordinator writes the Commit log) delay and 2 communications (the coordinator notifies the participant prepare + Coordinator to notify participant commit) of the delay.

Oceanbase Optimization for distributed transactions

Oceanbase's two-phase commit protocol optimizes this process, and whether a transaction is committed essentially depends on whether all the logs of the transaction are persisted to the hard disk, without relying on the coordinator's log, and the state of the full persistence of the log is also determined. Therefore, Oceanbase's two-phase commit process cancels the Coordinator's log-writing process, turning the coordinator into a state machine with no persistence state, as follows:

It looks complicated because there are a variety of anomalies that need to be dealt with in the actual project, as well as various optimizations. The participant state machine is as follows:

In the above coordinator and participants in the state machine, with the classic state machine in addition to a lot of exceptions and optimization of processing, there is one of the biggest difference is a CLEAR phase, the theoretical coordinator after the completion of the commit operation of the entire transaction process is over, but in the actual implementation, although the transaction state is determined, However, the coordinator and the participants may also be due to network drops or machine anomalies, such as the situation of information transmission uncertainty, need to query the state or retry, then the meaning of the CLEAR state is to ensure that all state machines have reached a certain state, the state machine corresponding to the data structure to clean up.

Oceanbase Optimization for the facilitator

As described earlier, in a single-machine transaction scenario, if a transaction needs to write multiple logs, after confirming that all log writes are successful, the last log is written to indicate that the entire transaction confirms the commit. This last log is information that can be combined with the last piece of the normal transaction log. However, in a distributed transaction, a commit log that is written after the coordinator of the two-phase commit confirms that the log on all participants has been successfully written cannot be combined with the log of any participant in the project. However, whether a transaction commits itself is determined when all participants ' logs are successfully persisted. Therefore, a major optimization done by Oceanbase is to omit the coordinator's commit log. After receiving the prepare OK message from all participants, the user does not write the local commit log and sends a commit request directly to all participants. The commit log is written when the participant receives a commit request, which is an operation that is also available in the original protocol.

In this mode, the extreme situation is when all participants successfully write down the prepare log, but before the coordinator sends a commit message, the system is down and how to recover the transaction that should have been successful. After the outage restarts, the participant discovers that the transaction is in the PREPARED state and asks the Coordinator for the state of the transaction. Because the coordinator also restarts the outage and the coordinator does not persist any information, there is no coordinator status. However, after receiving the request from the participant, the facilitator builds the coordinator state machine and asks all other participants about the state of the transaction, and when all participants are found to be PREPARED, it is determined that the transaction is the final commit, and then sends a commit request to all participants. The state machine is functioning normally.

Oceanbase optimization for single-machine multi-partition transactions

Oceanbase is deployed in partition, where multiple partition on a machine have separate Paxos group, so even a transaction on a single machine, if it involves multiple partition, is also a distributed transaction. The time for a multi-machine distributed transaction to answer a user is when all participants answer the Coordinator commit OK, Oceanbase uses the MVCC mechanism for concurrency control, and the participant needs to modify the global version number when the transaction commits (described in the following section), so the participants receive a commit When requested, you can modify the global version number and reply to the Coordinator before persisting the commit log. However, the commit operation round trip time on the network still needs to be consumed.

For single-machine multi-partition transactions, multiple partition need to be modified by the global version number is actually one, so the coordinator after receiving all participants reply prepare OK request, think the transaction can be submitted, can help participants to modify the global version number, because the single- The coordinator of the partition transaction must also be on the same machine. Therefore, the facilitator can answer the user's transaction submission success when the participant prepare OK is received, reducing the consumption of a commit round trip.

Isolation (Isolation)

A database system cannot serve only one user, and it requires multiple users to access the data at the same time. Therefore, it is very challenging to guarantee the atomicity of the transaction while ensuring the correctness of the data when multiple users are simultaneously accessing it. The database uses a concurrency control technology (Concurrency control) to control and coordinate transactions that operate concurrently in the database system. Common concurrency control algorithms are lock-based Concurrency control and multiple Version Concurrency control.

lock-based Concurrency Control

This method is similar to the lock mechanism commonly used in data structure design. The database system locks up every row of data that the user operates during a transaction, and if it is read, it is added with a read lock, and a write lock if it is a write operation. If two different transactions attempt to modify the same row of data, the first transaction that has a write lock on this data is performed normally. The second transaction waits for the first transaction to be executed in order to ensure the correctness of the data. The second transaction resumes execution until the row lock is released after the first transaction execution finishes.

about

Also take the transfer operation as an example, the above marked a B two accounts have 100 and 200 pieces respectively. If transaction 1 performs an operation from a to 10 to a B account, then during the execution of the transaction, a B two rows of data will be added to the write lock. If the second transaction performs another transfer, from the a account to the 50 to the B account, then the write lock for row A is not added, and transaction two waits until the lock is added.

In the above scenario, if another transaction three queries the information of a B two account during transaction one or two execution, the purpose is to count the sum of the balances for the two accounts of a B. Then you need to read a B two lines, read before the first read the lock, but in the transaction one or two operation, a B two rows are holding a write lock, so the transaction three need to wait for transaction one or two to end before the operation.

Multiple Version Concurrency Control

In the above example, an operation that reads the sum of the account balances regardless of whether the transaction one or two is executed or not, the result is deterministic, and the result must be 300 blocks. In the real system, the parallel execution of transactions can be scheduled by the database system, so there is a multiple version Concurrency Control (MVCC) method, for every modification of the database data is retained version, This allows the read operation to be performed directly on the historical version without being affected by the modification operation.

Continue with the above example of transaction three. Under the MVCC method, each transaction will have a commit version, assuming that transaction one is 100 and transaction two is 101. After the modifications are complete, the data status in the database is as follows:

101  +      ------98 101 260 >--    98

The initial version of the data is 98. The records for each data modification are concatenated together. There is also a global variable, Committed version (GCV), which indicates the last-committed release number globally. Before the transaction one executes GCV is 98, after the transaction one commits GCV becomes 100, after transaction two commits GCV becomes 101. So, when a transaction starts executing, it gets GCV and then reads the corresponding data according to the version number.

The MVCC modification will still depend on the lock mechanism mentioned above, so the conflict between write operations still needs to wait. But the biggest advantage of MVCC is that the read operation is completely isolated from the write operation and does not affect each other. This is useful for database performance and concurrency improvements.

Oceanbase's Solution

When using the MVCC scenario, the GCV modification needs to be incremented, if GCV is modified to 101, indicating that all transactions before 101 and before are committed. This guarantee becomes difficult when Oceanbase uses distributed transactions, for example, a transaction with version number 100 may be a distributed transaction, 101 is a stand-alone transaction, and the commit process of a distributed transaction is significantly longer than a stand-alone transaction. As a result, the transaction of version number 101 is completed first, and the GCV is modified to 101. However, the 100-version transaction is still in the commit process and cannot be read.

Therefore, the oceanbase uses the MVCC and lock-based method, the read operation obtains the GCV first, then reads each row data according to this version number. If there is no modification on this line of data, or if there is modification but the modified version number is greater than GCV, then this line of data can be read directly according to the version number. If there is a committed transaction on this line of data, and the version number is likely to be less than the version number used for the read, the read operation needs to wait for the transaction to end, as if the read lock waits for a write lock in the lock-based scheme. However, the probability of this happening is very low, so the overall performance is as good as MVCC.

Summarize

This article is only from the Atomic (atomicity) and isolation (isolation) Two angles explained from the common implementation of the scheme to Oceanbase distributed environment of the trade-offs and optimization, oceanbase to achieve more optimization are in the details, welcome interested students to communicate more.

Oceanbase 1.0 of distributed transactions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.