HBase Write Request analysis, hbase Write Request
As a distributed NoSQL database system, HBase not only supports wide lists, but also provides high performance for random read/write. HBase can maintain transaction consistency while performing high-performance random read/write transactions. Currently, HBase only supports row-level transaction consistency. This article mainly discusses the HBase write request process, mainly based on0.98.8Version implementation.
Client Write Request
The Java client API provided by HBase is based on HTable and corresponds to the HBase table. The write Request APIs are mainly HTable. put (write and update) and HTable. delete. Taking HTable. put as an example, we first look at how the client sends requests to the HRegionServer.
Each put request represents a KeyValue data, considering that the client has a large amount of data to be written to the HBase table, HTable. by default, put puts each put request in the local cache. When the local cache size exceeds the threshold (2 MB by default), the request is refreshed, that is, these put requests are sent to the specified HRegionServer. Here, multiple put requests are sent to different HRegionServer concurrently using the thread pool. However, if multiple requests are the same HRegionServer or even the same HRegion, it may cause pressure on the server. To avoid this situation, the client API sets a limit on the number of concurrent write requests, mainly for the HRegionServer and HRegion to be sent by put requests. The specific implementation is in AsyncProcess. The main parameters are as follows:
- Hbase. client. max. total. tasksMaximum number of concurrent write requests on the client. The default value is 100.
- Hbase. client. max. perserver. tasksMaximum number of concurrent write requests for each HRegionServer on the client. The default value is 2.
- Hbase. client. max. perregion. tasksMaximum number of concurrent write requests per HRegion on the client. The default value is 1.
To improve I/O efficiency, AsyncProcess merges put requests corresponding to the same HRegion, and sends these put requests with the same HRegion to the specified HRegionServer again. In addition, AsyncProcess also provides various synchronization methods, such as waitUntilDone, to facilitate synchronous processing of requests in some scenarios. Like read requests, each put request accesses the hbase: meta table to find the specified HRegionServer and HRegion. This process is consistent with the Read Request. For details, refer to the description in this article.
Server Write Request
When the client sends the write request to the server, the server starts to write the request. HRegionServer forwards write requests to the specified HRegion for execution. Each HRegion operation is performed in the unit of batch write requests. The main process is implemented in HRegion. doMiniBatchMutation, which is roughly as follows:
- Obtain the row lock of the specified row in the write request. Because these batch write requests do not guarantee consistency (only ensure row consistency), each time only attempts to block the row lock for at least one write request, other acquired row locks will skip this update and wait for the next iteration to continue.
- Update the timestamp of the write request that has obtained the row lock to the current time.
- Obtain the read lock of the updatesLock of HRegion.
- Obtain the latest write Number of MVCC (Multi-Version Concurrency Control) and write it to MemStore together with the KeyValue data of the write request.
- Construct the WAL (Write-Ahead Logging) edit object
- Add the WAL edit object to HLog asynchronously to obtain the txid.
- Release the read lock of the updatesLock in step 1 and the row lock obtained in step 2.
- Follow the txid in step 1 to synchronize HLog
- Commit the transaction and forward the read sequence number of mvcc to the write sequence number obtained in step 1.
- If the above steps fail, roll back the data that has been written to MemStore
- If the size of the MemStore cache exceeds the threshold, request the current HRegion MemStore refresh operation.
After the above steps, the write request is a committed transaction, and the subsequent read requests can read the data of the write request. These steps contain various features of HBase, mainly to ensure the performance of a considerable write request, but also to ensure the ACID properties of a row-level transaction. Next, we will analyze the details of some major steps.
UpdatesLock of HRegion
The updatesLock of HRegion is obtained in step 3 to prevent thread conflicts between MemStore and Write Request transactions during the flush process.
First, you must know the role of MemStore in writing requests. To improve read performance, HBase ensures that the data stored on HDFS must be in an orderly manner, so that various features such as binary search can be used to improve read performance. However, HDFS does not support modification. Therefore, we must adopt a method to change random write to sequential write. MemStore is designed to solve this problem. Random write data, such as MemStore, can be sorted in the memory. When the MemStore size exceeds the threshold, it needs to be flushed to HDFS for storage in HFile format, obviously, the HFile data is ordered, so that random write is changed to sequential write. In addition, MemStore is also one of the implementations of HBase's LSM Tree (Log-Structured Merge Tree.
When MemStore is flush, to avoid impact on read requests, MemStore creates a snapshot for the current Memory Data kvset and clears the content of the kvset, when querying the KeyValue, a Read Request also queries snapshot at the same time, which will not be greatly affected. Note that write requests write data to the kvset. Therefore, you must lock the data to avoid thread access conflicts. Because multiple write requests may exist at the same time, the Write Request obtains the readLock of the updatesLock and only one snapshot at the same time. Therefore, the updatesLock writeLock is obtained.
Obtain the MVCC write number
MVCC is a concurrent transaction control mechanism for HBase to ensure row-level transaction consistency while improving read requests. The MVCC mechanism is not difficult to understand. You can refer to it here.
The biggest advantage of MVCC is that read requests and write requests do not block and conflict with each other. Therefore, read requests generally do not need to be locked (only two write requests that write data in the same row need to be locked ), only when a write request is submitted can the data of the write request be viewed in the read request. This avoids "Dirty read" and ensures transaction consistency. For more information about MVCC implementation, see this article by a PMC Member in HBase.
WAL (Write-Ahead Logging) and HLog
WAL is a mechanism that allows HBase to recover data from other nodes to avoid node faults that cannot be served. By default, HBase writes KeyValue data into a WALEdit object and serializes it to HLog. In version 0.98.8, ProtoBuf is used to serialize WAL. HLog is a log file that records HBase modifications. Like the data file HFile, HLog is also stored on HDFS, which ensures the reliability of HLog files. In this way, if the machine goes down, the KeyValue data stored in MemStore will be lost, and HBase can use the modified logs recorded in HLog for data recovery.
Each HRegionServer has only one HLog object. Therefore, all HRegion modifications on the current HRegionServer are recorded in the same log file, when you need to recover the data, split the modification Log (Log Splitting) in the HLog according to HRegion ).
Throughout the write request, writing the WALEdit object serialized to the HLog is the only step that will cause I/O, which will greatly affect the write request performance. Of course, if the business scenario does not require high data stability and the key is to write requests, you can call Put. setDurability (Durability. SKIP_WAL) to skip this step.
In order to reduce the impact of I/O on HLog writing, HBase adopts a multi-thread concurrency mode with finer granularity (see HBASE-8755 for details ). The implementation of HLog is FSHLog. The main process involves three objects: AsyncWriter, AsyncSyncer, and AsyncNotifier. The entire write process involves Steps 5-8.
- HRegion calls FSHLog. appendNoSync to add the modification record to the local buffer, notifies AsyncWriter to insert a record, and then returns a long-type incremental txid as the modification record. Note that this is an asynchronous call.
- After HRegion, the read lock and row lock of the updatesLock will be released immediately, and then FSHLog. sync (txid) will be called to wait for the previous modification records to be written to the HLog.
- AsyncWriter extracts modification records from the local buffer, compresses the records, and serializes ProtoBuf into the cache of FSDataOutputStream, and then notifies AsyncSyncer. AsyncSyncer has a large workload, so there are a total of five threads, and AsyncWriter will select one of them to wake up.
- AsyncSyncer determines whether another AsyncSyncer thread has completed the synchronization task. If yes, it will continue to wait for the AsyncWriter synchronization request. Otherwise, write the FSDataOutputStream cache to HDFS, and then wake up AsyncNotifier.
- The task of AsyncNotifier is relatively simple. It only wakes up all write request threads waiting for synchronization. However, the process is also time-consuming. Therefore, the AsyncNotifier thread is used instead of the AsyncSyncer to complete the notification task.
- When HRegion is awakened, it finds that its txid has been synchronized, that is, the modification record is written to the HLog, so other operations are performed.
In the above writing process, HRegion first writes records to the buffer of HLog, and then releases the lock obtained before synchronization waits until the writing is completed, this effectively reduces the lock hold time and increases the concurrency of other write requests. In addition, a new write model composed of AsyncWriter, AsyncSyncer, and AsyncNotifier mainly undertakes HDFS write operations and compares the old write model (each write request thread is required to write HDFS, A large number of threads lead to severe lock Competition). The main reason is to greatly reduce the lock competition in the thread synchronization process and effectively improve the thread throughput. This write process can improve the throughput for a large number of write requests, but in environments where the Write Request concurrency is small and the thread competition is low, because each write request must wait for Async * thread synchronization, the overhead of thread context switching is increased, resulting in a slight reduction in performance (the LMAX Disruptor synchronization model is used in version 0.99, and the FSHLog is reconstructed, HBASE-10156 ).
MVCC read sequence number forward
After writing HLog, the entire Write Request transaction has completed the process. Therefore, you need to submit the transaction so that other read requests can see the data of the write request. I have already introduced the role of MVCC. Here I will take a look at how MVCC handles the moving of read numbers.
MVCC maintains a long write serial number memstoreWrite, a long read serial number memstoreRead, and a queue writeQueue. When HRegion calls beginMemStoreInsert to allocate a write sequence number, it will increase the write sequence number by 1, return the number, and add a write request to the end of writeQueue. The Code is as follows:
public WriteEntry beginMemstoreInsert() { synchronized (writeQueue) { long nextWriteNumber = ++memstoreWrite; WriteEntry e = new WriteEntry(nextWriteNumber); writeQueue.add(e); return e; }}
HRegion associates the write sequence number with each newly inserted KeyValue data. When the write request is complete, HRegion calls completeMemstoreInsert to forward the read sequence number. MVCC first records the write request as complete, and then views the writeQueue queue, retrieve all completed write requests from the queue header. The sequence number of the last completed write request is assigned to memstoreRead, which indicates that this is the maximum readable read sequence number currently, if the number of HRegion write requests is smaller than the number of read requests, the transaction is committed. Otherwise, HRegion will wait until the commit is completed in a loop. The related code is as follows:
public void completeMemstoreInsert(WriteEntry e) { advanceMemstore(e); waitForRead(e);} boolean advanceMemstore(WriteEntry e) { synchronized (writeQueue) { e.markCompleted(); long nextReadValue = -1; while (!writeQueue.isEmpty()) { ranOnce=true; WriteEntry queueFirst = writeQueue.getFirst(); //... if (queueFirst.isCompleted()) { nextReadValue = queueFirst.getWriteNumber(); writeQueue.removeFirst(); } else { break; } } if (nextReadValue > 0) { synchronized (readWaiters) { memstoreRead = nextReadValue; readWaiters.notifyAll(); } } if (memstoreRead >= e.getWriteNumber()) { return true; } return false; }} public void waitForRead(WriteEntry e) { boolean interrupted = false; synchronized (readWaiters) { while (memstoreRead < e.getWriteNumber()) { try { readWaiters.wait(0); } catch (InterruptedException ie) { //... } } }}
It can be seen that MVCC ensures the serial sequence of transaction commit. If a write request is successfully submitted, any write request whose write sequence number is smaller than the write sequence number must be submitted successfully. Therefore, when reading a request, you only need to obtain the Read Request serial number of MVCC to read the write data of any newly submitted write request. In addition, MVCC only limits the serialization of the transaction commit process. In the actual write request process, other steps allow concurrency, so it does not have a great impact on performance.
At this point, the transaction commit process of a write request of HBase is complete. Throughout the write process, a large number of methods are used to avoid lock competition, shorten the time for obtaining the lock, and ensure transaction consistency. Because MemStore has a limited memory cache size, when MemStore exceeds the threshold, HBase needs to refresh the data to HDFS to form a new HFile. Next let's take a look at this process.
MemStore flush
When a large amount of write request data is added to MemStore and MemStore exceeds the threshold, HRegion requests to flush MemStore data to HDFS. Note that the flush unit is a single HRegion, that is, if there are multiple hstores, as long as one MemStore exceeds the threshold value, all hstores of this HRegion must perform the flush operation.
- HRegion first needs to obtain the updatesLock write lock to prevent new write requests from arriving.
- Request to obtain the write Number of MVCC
- Request MemStore to generate snapshot
- Release the updatesLock write lock
- MVCC write sequence number obtained before submission, waiting for the completion of the previous transaction to prevent rollback transactions from writing to HFile
- Write the KeyValue data of snapshot into HFile.
Focus on writing the KeyValue data of snapshot into HFile. Let's take a look at the HFile format:
The format of HFile has been introduced in the previous request article. HFile must ensure that each HBlock has a size of about 64 KB. The HFile structure is constructed using DataBlock multi-level indexes and BloomFilter primary indexes. The entire write process is relatively simple. in the loop, it is convenient to obtain the KeyValue data of the snapshot of MemStore, and then constantly write the data in DataBlock. If the total size of the current DataBlock exceeds 64 KB, then, DataBlock stops adding data (compression is performed when compression is set), calculates the index of DataBlock, and adds it To the memory, in addition, if the BloomFilter attribute is enabled, the corresponding BloomBlock will also be written. In this process, you will be aware to save FileInfo data such as uncompressed size.
When all the snapshot data is written into DataBlock, it is necessary to write the multi-level index of DataBlock. HBase calculates the level of multi-level indexes based on the previously stored indexes. If the number of indexes is small, there may be only one level of RootIndexBlock. The data of the MidKey is also obtained based on the RookIndexBlock. Finally, FileInfo and BloomFilter indexes and Trailer are written in sequence.
Summary
HBase uses MemStore to change random write to sequential write, which helps improve the efficiency of read requests. In addition, to avoid data loss, use HLog to record the modification log. Throughout the write process, multiple methods are used to reduce lock competition, improve thread throughput, and shorten the lock acquisition time to improve concurrency as much as possible. MVCC can be used to avoid the impact between read and write requests.
Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.