"Rocksdb" transactiondb Source analysis

Source: Internet
Author: User

Summary: ROCKSDB version: v5.13.4 1. Overview Thanks to the lsm-tree structure, rocksdb All writes are not update in-place, so the difficulty of supporting the transaction is relatively small, and the main principle is to use Writebatch to package the office with write operations in the memory cache. The Writebatch is then written at one time at commit, guaranteeing the atom, and also by sequence and key lock to resolve the conflict implementation isolation.

Rocksdb version: v5.13.4

    1. Overview
      Thanks to the lsm-tree structure, rocksdb All writes are not update in-place, so the difficulty of supporting the transaction is relatively small, and the main principle is to use Writebatch to package the office with write operations in the memory cache. The Writebatch is then written at one time at commit, guaranteeing the atom, and also by sequence and key lock to resolve the conflict implementation isolation.

Rocksdb's transaction are divided into two categories: pessimistic and optimistic, similar to the difference between pessimistic and optimistic locks, Pessimistictransaction conflict detection and locking is done before each write operation in the transaction (after a commit is released), if the operation fails, the optimistictransaction does not lock, the conflict detection is done in the commit phase, Commit fails when a conflict is found.

The specific use of the actual scene to choose, if the concurrent transaction write operation key overlap is not high, then use optimistic more appropriate (to eliminate the extra lock operation in pessimistic)

    1. Usage
      Before introducing the implementation principle, let's look at the usage:

"1. Basic usage "

Options options;TransactionDBOptions txn_db_options;options.create_if_missing = true;TransactionDB* txn_db;// 打开DB(默认Pessimistic)Status s = TransactionDB::Open(options, txn_db_options, kDBPath, &txn_db);assert(s.ok());// 创建一个事务Transaction* txn = txn_db->BeginTransaction(write_options);assert(txn);// 事务txn读取一个keys = txn->Get(read_options, "abc", &value);assert(s.IsNotFound());// 事务txn写一个keys = txn->Put("abc", "def");assert(s.ok());// 通过TransactionDB::Get在事务外读取一个keys = txn_db->Get(read_options, "abc", &value);// 通过TrasactionDB::Put在事务外写一个key// 这里并不会有影响,因为写的不是"abc",不冲突// 如果是"abc"的话// 则Put会一直卡住直到超时或等待事务Commit(本例中会超时)s = txn_db->Put(write_options, "xyz", "zzz");s = txn->Commit();assert(s.ok());// 析构事务delete txn;delete txn_db;

Open a transaction through BeginTransaction, then call the put, get, and other interfaces for the transaction operation, and finally call commit to commit.

"2. Roll Back "

...// 事务txn写入abcs = txn->Put("abc", "def");assert(s.ok());// 设置回滚点txn->SetSavePoint();// 事务txn写入cbas = txn->Put("cba", "fed");assert(s.ok());// 回滚至回滚点s = txn->RollbackToSavePoint();// 提交,此时事务中不包含对cba的写入s = txn->Commit();assert(s.ok());...

"3. Getforupdate "

...// 事务txn读取abc并独占该key,确保不被外部事务再修改s = txn->GetForUpdate(read_options, “abc”, &value);assert(s.ok());// 通过TransactionDB::Put接口在事务外写abc// 不会成功s = txn_db->Put(write_options, “abc”, “value0”);s = txn->Commit();assert(s.ok());...

Sometimes it is necessary to read and write a key in a transaction, and at this point the exclusive and conflicting detection of the key cannot be performed at the time of writing, so the Getforupdate interface is used to read the key and exclusive

"4. Setsnapshot "

txn = txn_db->BeginTransaction(write_options);// 设置事务txn使用的snapshot为当前全局Sequence Numbertxn->SetSnapshot();// 使用TransactionDB::Put接口在事务外部写abc// 此时全局Sequence Number会加1db->Put(write_options, “key1”, “value0”);assert(s.ok());// 事务txn写入abcs = txn->Put(“abc”, “value1”);s = txn->Commit();// 这里会失败,因为在事务设置了snapshot之后,事务后来写的key// 在事务外部有过其他写操作,所以这里不会成功// Pessimistic会在Put时失败,Optimistic会在Commit时失败

As mentioned earlier, TRANSACTIONDB is exclusive or conflicting when a key is required to be written in a transaction, and sometimes it is desirable to monopolize all the keys after it is written at the beginning of the transaction, which can be achieved by setsnapshot, when snapshot is set. Externally once a transaction is about to be written in the key has been modified, the transaction will eventually fail (the failure point depends on whether it is pessimistic or optimistic,pessimistic because of the conflict detection at put, so the put will fail, And optimistic will detect a conflict in commit, fail)

    1. Realize
      3.1 Writebatch & Writebatchwithindex
      Writebatch does not expand and says, the transaction appends all writes to the same writebatch until the commit is written to the db atom.

Writebatchwithindex In addition to Writebatch, an additional skiplist to record each operation in the writebatch of the offset and other information. Before the transaction does not commit, the data is not in the memtable, but exists in the writebatch, if necessary, this time can be writebatchwithindex to get the data that you have just written but not yet committed.

The Setsavepoint and Rollbacktosavepoint of the transaction are also realized through Writebatch, Setsavepoint record the current writebatch size and statistics, after a number of operations, if you want to rollback, You only need to truncate the writebatch to the size of the previous record and restore the statistics.

3.2 pessimistictransaction
PESSIMISTICTRANSACTIONDB is managed through the Transactionlockmgr row lock. Before each write operation in a transaction requires an exclusive and conflicting detection of the Trylock key lock, take the put as an example:

Status TransactionBaseImpl::Put(ColumnFamilyHandle* column_family,                                const Slice& key, const Slice& value) {  // 调用TryLock抢锁及冲突检测  Status s =      TryLock(column_family, key, false /* read_only */, true /* exclusive */);  if (s.ok()) {    s = GetBatchForWrite()->Put(column_family, key, value);    if (s.ok()) {      num_puts_++;    }  }  return s;}

You can see that the put interface is defined in transactionbase, whether pessimistic or optimistic put is this logic, the difference is in the overloading of the trylock. See pessimistic first, transactionbaseimpl::trylock through Transactionbaseimpl::trylock-Pessimistictransaction::trylock- > Pessimistictransactiondb::trylock-Transactionlockmgr::trylock all the way to Transactionlockmgr TryLock, In the inside of the key lock, the lock succeeds to achieve the exclusive of key, at this time until the transaction commit, the other transaction is unable to modify the key.

The lock was added successfully, but this only means that the key will not be externally modified until the end of the transaction, but if the transaction is at the beginning of the execution of Setsnapshot set a snapshot, if the same key is modified (and commits) outside the process between the snapshot and the put, At this time has broken the snapshot guarantee, so the put after the transaction can not be successful, the conflict detection is also done in Pessimistictransaction::trylock, as follows:

Status Pessimistictransaction::trylock (columnfamilyhandle* column_family, const Slic e& Key, BOOL READ_ONLY, BOOL exclusive, BOOL skip_validate) {...//locking if (! previously_locked | |  Lock_upgrade) {s = Txn_db_impl_->trylock (this, cfh_id, KEY_STR, exclusive);  } setsnapshotifneeded (); ...//Use a transaction to get the snapshot Sequence1 with this key in the db of the latest//Sequence2 comparison, if Sequence2 > Sequence1 is represented in snapshot///after the external    There is a conflict in the write to key!      s = validatesnapshot (column_family, Key, &TRACKED_AT_SEQ);  if (!s.ok ()) {//detected conflict, unlock//Failed to validate key if (!previously_locked) {//Unlock key                                      We just locked if (lock_upgrade) {s = Txn_db_impl_->trylock (this, cfh_id, Key_str,            False/* Exclusive */);          ASSERT (S.ok ()); } else {Txn_db_impl_->unlock (this, cfh_id, key.          ToString ());   }     }} if (S.ok ()) {///If locking and conflict detection pass, record this key to release the lock at the end of the transaction//We must track all the locked keys so, we can u Nlock them later. IF//The key is already locked, this func would update some stats on the//tracked key.    It could also update the TRACKED_AT_SEQ if it is lower than//the existing Trackey seq.  Trackkey (cfh_id, Key_str, Tracked_at_seq, READ_ONLY, exclusive); }}

Where Validatesnapshot is a conflict detection, by comparing the snapshot of the transaction set with the latest key sequence, if it is less than the latest sequence of key, then the external transaction modifies the key after setting snapshot. There is a conflict! Get key latest sequence is also simple rude, traverse memtable,immutable memtable,memtable list History and SST file to get it. Summarized as:

Getforupdate logic and put almost, nothing more than to get the name of the line put (locking and conflict detection), such as:

Then introduce the next transactionlockmgr, such as:

The outermost layer is first a std::unordered_map, each columnfamily is mapped to a lockmap, each lockmap has 16 lockmapstripe by default, and each lockmapstripe contains an std: Unordered_map keys, this is the lock information that is stored for each key. So each lock process is roughly the following:

First, get lock_maps hands through threadlocal.
Get the corresponding Lockmap with the column family ID
The key hash map to a lockmapstripe, the lockmapstripe locking (all keys under the same lockmapstripe will grab the same lock, slightly larger granularity)
Operation Lockmapstripe in the Std::unordered_map complete lock
3.3 Optimistictransaction
OPTIMISTICTRANSACTIONDB does not use locks for the exclusive of key, only in commit is the conflict detection. So Optimistictransaction::trylock is as follows:

Status Optimistictransaction::trylock (columnfamilyhandle* column_family, const slice& Amp Key, BOOL READ_ONLY, BOOL exclusive, BOOL untracked) {if (untracked) {return St  Atus::ok ();  } uint32_t cfh_id = Getcolumnfamilyid (column_family);  Setsnapshotifneeded ();  If the previous transaction snapshot is set, it is used here as the SEQ for key//If no snapshot is set, then the SEQ sequencenumber seq with the current global sequence as key;  if (snapshot_) {seq = Snapshot_->getsequencenumber ();  } else {seq = Db_->getlatestsequencenumber (); } std::string key_str = key.  ToString (); Record this key and its corresponding SEQ, later at commit by using this SEQ and//key current sequence comparison to do conflict detection Trackkey (cfh_id, Key_str, seq, read_only,  Exclusive); Always return OK.  CONFILCT checking would happen at commit time. return Status::ok ();} Here Trylock is actually to mark a key to a sequence and record, as a commit when the collision detection, commit implementation is as follows: Status Optimistictransaction::commit () {//Set up Callback which would call Checktransactionforconflicts () to//check wheTher this transaction are safe to be committed.  Optimistictransactioncallback callback (this);  dbimpl* Db_impl = Static_cast_with_check<dbimpl, db> (Db_->getrootdb ()); Call Writewithcallback for conflict detection and, if there is no conflict, write to db Status s = Db_impl->writewithcallback (Write_options_, Getwritebatch ()-& Gt  Getwritebatch (), &callback);  if (S.ok ()) {Clear (); } return s;}

The implementation of conflict detection in Optimistictransactioncallback, like the pessimistictransaction set up with snapshot, will eventually call Transactionutil: Checkkeysforconflicts to detect, that is, compare sequence. As a whole:

3.4 Two-phase commit (Phase commit)
When using Pessimistictransaction in a distributed scenario, we may need to use two phase commit (2PC) to ensure that a transaction succeeds on multiple nodes, so pessimistictransaction also supports 2PC. The practice is not difficult, is to split the previous commit into the prepare and commit,prepare phase of the Wal write, commit phase of the memtable write (after writing to other parties visible), so now a transaction operation flow is as follows:

BeginTransactionGetForUpdatePut...PrepareCommit

Using 2PC, we first set a unique identifier for a transaction through SetName and register it in the Global mapping table, where all outstanding 2PC transactions are recorded, and then deleted from the mapping table when commit.

The next specific 2PC implementation is nothing more than a fuss on the writebatch, through special tags to control writing wal and Memtable, simply say:

The normal Writebatch structure is as follows:

Sequence (0); NumRecords (3); Put (a,1); Merge (a,1);D elete (a);
The Writebatch at the beginning of 2PC are as follows:

Sequence (0); NumRecords (0); Noop;
First use a noop placeholder, as for why, after that. followed by some operations, after operation, Writebatch as follows:

Sequence (0); NumRecords (3); Noop; Put (a,1); Merge (a,1);D elete (a);
Then execute prepare, write Wal, before writing Wal, the team Writebatch make some changes, insert prepare and Endprepare records, as follows:

Sequence (0); NumRecords (3); Prepare (); Put (a,1); Merge (a,1);D elete (a); Endprepare (XID)
You can see here to replace the previous NoOp placeholder with prepare, and then insert the Endprepare (XID) at the end, and after constructing the Writebatch, call Writeimpl to write the Wal directly. Note that the sequence of this log to Wal at this time is larger than the last_sequence of Versionset, but it does not call setlastsequence to update versionset after writing to Wal, It is only updated after the last write memtable, the specific method is to give versionset in addition to Last_sequence_, plus a last_allocatedsequence, the initial equality, write Wal is the latter, The latter is not visible, and the former is added after the commit. So once the pessimistictransactiondb used 2PC, all are required to be 2PC, or last_sequence_ may be confused (correction: If you use Two_writequeues, whether prepare Commit or direct commit,sequence growth is based on Last_allocated_sequence_, and finally use it to adjust the lastsequence; If you don't use Two_write_ Queues_ directly to the Last_sequence_, in short, will not appear sequence mixed error, so you can prepare, commit and commit mixed).

After the Wal has written, even if there is no commit to the outage is OK, after restarting recovery will resume transactions from the Wal to the global Recovered_transaction, waiting for a commit

Finally, the commit,commit phase will use a new Committime writebatch, and the previous Writebatch merge and finally use Committime Writebatch write memtable

After finishing the Committime Writebatch as follows:

Sequence(0);NumRecords(3);Commit(xid);Prepare();Put(a,1);Merge(a,1);Delete(a);EndPrepare(xid);

Set the walterminalpoint of the Committime Writebatch to commit (XID) and tell writer to write the Wal and write it here to stop. In fact, just write the commit record into Wal (because the record is written to Wal at the prepare stage);

In the end is Memtableinserter traverse this committime writebatch to memtable write, specifically do not say. After the write succeeds, the lastsequenceof Versionset is updated, and the transaction is submitted successfully.

    1. Writeprepared & writeunprepared
      We can see that both pessimistic and optimistic have a common disadvantage, that is, before the final commit of the transaction, so the data is cached in memory (Writebatch), for a large transaction, This is very memory-intensive and throws all the actual write pressure to the commit phase, with performance bottlenecks, So Rocksdb is supporting writepolicy for writeprepared and Writeunprepared Pessimistictransaction, the main idea is to write the memtable in advance,

If we put it in the prepare stage, that's writeprepared.

If you move forward, write memtable every time you do it, that's writeunprepared.

You can see that the writeunprepared, whether memory occupied or write pressure point dispersion, do the best, writeprepared less.

The difficulty in supporting these new writepolicy is how to ensure that data written to memtable but not yet committed is not seen by other things, and here it is necessary to make a fuss on sequence, Rocksdb now supports Writeprepare, And writeunprepared has not supported, look forward to follow ...

    1. Isolation level
      Look at the previous introduction, there is no need to expand to say

TRANSACTIONDB supports readcommitted and repeatablereads levels of isolation
Please add a link to the original link description
This article is the original content of the cloud Habitat community, not reproduced without permission

"Rocksdb" transactiondb Source analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.