ROCKSDB locking mechanism

Source: Internet
Author: User
Tags data structures mutex

Rocksdb as an open source storage engine to support the acid characteristics of the transaction, and to support the acid in the I (isolation), concurrency control This block is indispensable, this article mainly discusses the ROCKSDB lock mechanism implementation, the details will involve the source code analysis, It is hoped that readers in this paper can gain insight into the rocksdb concurrency control principle. The article mainly from the following 4 aspects, the first will introduce the basic structure of ROCKSDB lock, and then I will introduce the ROCKSDB row lock data structure design, lock space overhead, then I will introduce several typical scenes of the locking process, and finally introduce the locking mechanism of the necessary deadlock detection mechanism.

1. Row lock data structure
    ROCKSDB lock granularity is the smallest row, for KV storage, the lock object is key, each key corresponds to a lockinfo structure. All key through the hash table management, find the lock, directly through the hash table positioning can determine whether the key has been locked. However, if there is only one hash table globally, there will be a lot of conflicts in this access hash table, which affects concurrency performance. The ROCKSDB is split first by columnfamily, and each lock in the columnfamily is managed by a lockmap, and each lockmap is split into several shards, each of which is managed by lockmapstripe, and the hash table (STD :: Unordered_map<std::string, Lockinfo>) is present in the stripe structure, and the stripe structure contains a mutex and a condition_variable, the main function of which is to Mutually exclusive access to the hash table, when a lock conflict occurs, the thread is suspended, and after unlocking, the suspended thread is awakened. This design is simple but also brings an obvious problem, that is, multiple unrelated locks common one condition_variable, resulting in the lock release, unnecessary wake up a batch of threads, and these lines thread after the test, the discovery still need to wait, resulting in an invalid context switch. In contrast to the INNODB lock mechanism we discussed earlier, we found that InnoDB is a page in which the record is multiplexed with a lock, and the reuse is conditional, the same transaction is locked for several records of a page to be reused, and the lock waiting queue is precise waiting, accurate to the record level, Does not cause an invalid wakeup. Although the ROCKSDB lock design is rough, it also makes some optimizations, such as when managing Lockmaps, by caching a copy lock_maps_cache_ locally on each thread, and chaining each thread's cache through the global list when Lockmaps changes ( Delete columnfamily), the global copy of each thread is emptied, because columnfamily changes rarely, so most of the access lockmaps operations are not required to lock, improve concurrency efficiency.

struct Lockinfo {bool exclusive;//exclusive lock or shared lock autovector<transactionid> txn_ids;//Transaction list, for shared locks, the same key can correspond to multiple transactions// Transaction locks is not valid after this time in usuint64_t expiration_time;} struct Lockmapstripe {//Mutex must be held before modifying keys mapstd::shared_ptr<transactiondbmutex> Stripe_mut ex;//Condition Variable per stripe for waiting on a lockstd::shared_ptr<transactiondbcondvar> stripe_cv;//Locked K Eys mapped to the info on the transactions that locked them.std::unordered_map<std::string, lockinfo> keys;} struct Lockmap {const size_t Num_stripes_;//number of shards std::atomic<int64_t> lock_cnt{0};//number of locks std::vector< Lockmapstripe*> Lock_map_stripes_; Lock Shard}class transactionlockmgr {using lockmaps = std::unordered_map<uint32_t, std::shared_ptr<lockmap>>; Lockmaps lock_maps_;//thread-local cache of entries in Lock_maps_. This was an optimization//to avoid acquiring a mutex in order to look up a Lockmapstd::unique_ptr<threadlocalpTr> lock_maps_cache_;} 

2. Line lock space cost
    Because the lock information is resident memory, we simply analyze the memory occupied by the ROCKSDB lock. Each lock is actually an element in the unordered_map, the lock occupies a memory of key_length+8+8+1, assuming that the key is bigint, which accounts for 8 bytes, then the 100w row record consumes approximately 22M of memory. However, due to the positive correlation of memory with key_length, the memory consumption of ROCKSDB is not controllable. We can simply calculate the range of Rocksdb as the MySQL storage engine when key_length. For single-column indexes, the maximum value is 2048 bytes, you can refer to the Max_supported_key_part_length implementation, for the composite index, the maximum index length is 3,072 bytes, you can refer to Max_supported_key_ Length implementation. Assuming the worst case scenario, key_length=3072, the 100w row record consumes 3G of memory, and if the lock is 100 million rows, it consumes 300G of memory, in which case the memory is at risk of bursting. So rocksdb provides parameter configuration max_row_locks to ensure that the memory is controllable, the default rdb_max_row_locks is set to 1G, for most of the key is bigint scene, in extreme cases, also need to consume 22G memory. In this regard, InnoDB is relatively friendly, the hash table key is (space_id, page_no), so no matter how big the key, the memory consumption of the key portion is constant. As I mentioned earlier, InnoDB is optimized for a transaction that requires a large number of records to be locked, and multiple records can be shared with a lock, which can also indirectly reduce memory.

3. Lock Process Analysis
    A simple look at the design of ROCKSDB lock data structure and the consumption of locks on memory resources. This section mainly introduces several typical scenarios, how ROCKSDB is locked. Like InnoDB, Rocksdb also supports MVCC, read not locked, for convenience, the following discussion is based on the ROCKSDB as an engine for MySQL to expand, mainly including three classes, based on the primary key updates, based on the update of the two-level index, based on the primary key range updates. Before you start the discussion, it is important to note that, unlike InnoDB, the Rocksdb update is also snapshot-based, and the InnoDB update is based on the current read, which also makes the performance rocksdb in the actual application under the same isolation level. For Rocksdb, at the RC isolation level, the snapshot is retrieved at the beginning of each statement, and at the RR isolation level, only one snapshot is taken at the beginning of the first statement, and all statements share the snapshot until the end of the transaction.

3.1. Primary key-based updates
Here the main interface is Transactionbaseimpl::getforupdate
1). Try to lock the key, if the lock is held by another transaction, you need to wait
2). Create snapshot
3). Call Validatesnapshot,get key to determine if key has been updated by comparing sequence
4). Since it is locking, then get snapshot, so the check must be successful.
5). Perform the update operation
There is a mechanism for delaying the acquisition of a snapshot, in fact, at the beginning of the statement, it is necessary to call Acquire_snapshot to get the snapshot, but in order to avoid the conflict caused by the retry, after the key lock, then obtain the snapshot, which ensures that the primary key updates based on the scenario, There is no scenario where validatesnapshot fails.

The stack is as follows:

1-myrocks::ha_rocksdb::get_row_by_rowid2-myrocks::ha_rocksdb::get_for_ update3-myrocks::rdb_transaction_impl::get_for_update4-rocksdb::transactionbaseimpl::getforupdate{// Locking 5-rocksdb::transactionimpl::trylock 6-rocksdb::transactiondbimpl::trylock 7-rocksdb::transactionlockmgr:: Trylock//Delay get snapshot, use 6-setsnapshotifneeded with Acquire_snapshot ()//Check if key corresponds to snapshot expiration 6-validatesnapshot 7-rocksdb:: Transactionutil::checkkeyforconflict 8-rocksdb::transactionutil::checkkey 9-rocksdb::D Bimpl:: Getlatestsequenceforkey//First read//Read Key5-rocksdb::transactionbaseimpl::get 6-rocksdb::writebatchwithindex:: GETFROMBATCHANDDB 7-rocksdb::D b::get 8-rocksdb::D bimpl::get 9-rocksdb::D Bimpl::getimpl//second Read} 

3.2. Range update based on primary key
1). Create a snapshot, based on an iterator that scans the primary key
3). Call Validatesnapshot,get key to determine if key has been updated by comparing sequence to
4). If key is updated by another transaction (the key corresponding to SequenceNumber is newer than snapshot), the retry
5). In case of retry, The old snapshot is released and the lock is released, Tx->acquire_snapshot (false), delaying the snapshot (after locking, then taking snapshot)
5 ). Call Get_for_update again, because the key has been locked at this time, retry must be successful.
6). Perform the update operation
7). Jump to 1, continue execution, Ends until the primary key does not meet the criteria.

3.3. Updates based on a level two index
This scenario is similar to 3.2, except that it is a step closer to locating the primary key process from a two-level index.
1). Create a snapshot, scan a two-level index based on an iterator
2). By reverse locating the primary key according to the level two index, and actually calling Get_row_by_rowid, this process will attempt to lock the key
3). Continue to traverse the next primary key according to the level two index and try to lock
4). When the returned level two index does not meet the criteria, the end

3.4 The difference between the lock and the InnoDB
before we talk about the difference between Rocksdb and InnoDB is that for the update scenario, Rocksdb is still a snapshot read, and InnoDB is the current read, resulting in behavioral differences. For example, in the RC isolation level under the scope of the update scenario, such as a transaction to update 1000 records, because it is a side scan edge lock, may be scanning to the No. 999 record, it is found that the key sequence larger than the scan snapshot (this key is updated by other transactions), This will trigger a re-fetch of the snapshot and then get the latest key value based on the snapshot. InnoDB does not have this problem, through the current reading, scanning process, if the No. 999 record is updated, InnoDB can directly see the latest records. In this case, ROCKSDB and InnoDB see the same result. In another case, the assumption is also the scope of the scan, the newly inserted key, the key sequence will undoubtedly be larger than the scan snapshot, so during the scan process this key will be filtered out, there is no so-called collision detection, this key will not be found. During the update process, two records with IDs 1 and 900 are inserted, and the last No. 900 record is not updated because it is not visible. For InnoDB, because it is currently read, the newly inserted record with ID 900 can be seen and updated, so this is a different place from InnoDB.

4. Deadlock Detection algorithm
Deadlock Detection using DFS ((Depth first Search, depth priority algorithm), the basic idea based on the join wait relationship, continue to find waiting for the waiting relationship, if the discovery of a ring, it is considered to have a deadlock, of course, under the large concurrency system, the lock waiting relationship is very complex, In order to control the resource consumption of deadlock detection in a certain range, can be set deadlock_detect_depth to control the depth of deadlock detection search, or in a specific business scenario, it is believed that there will be no deadlock, closed deadlock detection, so that to a certain extent conducive to the system concurrency improvement. It is necessary to note that if the deadlock is turned off, it is best to set the lock waiting time-out setting to a smaller size to prevent the system from actually having a deadlock when the transaction has been stuck for a long time. The basic flow of deadlock detection is as follows:
1. Locate a specific shard to obtain the mutex
2. Call acquirelocked to try to lock
3. If the lock fails, the deadlock detection is triggered
4. Call Incrementwaiters to add a waiting person
5. If the waiting person is not in the waiting map, there must be no deadlock and return
6. For the waiting, check the waiting relationship down the WAIT_TXN_MAP_ to see if the ring
7. If a loop is found, the Decrementwaitersimpl will be called to dismiss the newly added wait relationship and report a deadlock error.

Related Data structures:

class transactionlockmgr {//must be held when modifying WAIT_TXN_MAP_ and Rev_wait_ Txn_map_.std::mutex wait_txn_map_mutex_;//Maps from waitee to number of waiters. Hashmap<transactionid, int> rev_wait_txn_map_;//Maps from waiter to Waitee. Hashmap<transactionid, autovector<transactionid>> wait_txn_map_;decrementwaiters//IncrementWaiters// }struct transactionoptions {bool Deadlock_detect = false;//detect deadlock int64_t deadlock_detect_depth = 50;//depth of deadlock detection int64_t lo Ck_timeout =-1; Wait for the lock time, the line is generally set to 5sint64_t expiration =-1; Hold lock Time,} 

Reference documents
Https://github.com/mdcallag/mytools/wiki/Cursor-Isolation
Https://www.postgresql.org/docs/9.4/static/transaction-iso.html
https://github.com/facebook/mysql-5.6/issues/340
Http://www.cnblogs.com/digdeep/p/4947694.html
Http://www.cnblogs.com/digdeep/archive/2015/11/16/4968453.html

ROCKSDB locking mechanism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.