HBase Write Request Analysis

Source: Internet
Author: User

HBase, as a distributed NoSQL database system, not only supports wide lists, but also has high performance for random read and write. At the same time as high-performance random read and write transactions, HBase can also maintain transactional consistency. Currently HBase supports only row-level transactional consistency. This paper mainly discusses the write request flow of HBase, mainly based on the implementation of 0.98.8 version.

Client Write Request

HBase provides the Java Client API with htable as the primary interface, corresponding to the HBase table in it. The Write request API is primarily Htable.put (write and update), Htable.delete, and so on. Take Htable.put as an example, first to see how the client sends the request to Hregionserver.

Each put request represents a keyvalue data, considering that the client has a large amount of data that needs to be written to the HBase table, Htable.put The default is to put each put request into the local cache, when the local cache size exceeds the threshold (the default is 2MB), it is necessary to request a refresh, that is, to send these put requests to the specified hregionserver, this is to use the thread pool to send multiple put requests to different Hregi Onserver. However, if more than one request is the same hregionserver, or even the same hregion, it can cause stress on the server, and to avoid this, the client API will limit the number of concurrent write requests. The main aim is to restrict the hregionserver and hregion that put requests are sent to, and implement them in asyncprocess. The main parameters are set to:

    • hbase.client.max.total.tasks Client Maximum number of concurrent write requests, default is 100
    • hbase.client.max.perserver.tasks Client Maximum number of concurrent write requests per Hregionserver, default is 2
    • hbase.client.max.perregion.tasks Client Maximum number of concurrent write requests per hregion, default is 1

To improve I/O efficiency, asyncprocess rendezvous with the put request corresponding to the same hregion, and then send the same hregion put request to the specified hregionserver again. In addition, Asyncprocess also provides a variety of synchronous methods, such as Waituntildone, to facilitate the synchronization of requests in certain scenarios. As with each put and read request, the Hbase:meta table is accessed to find the specified Hregionserver and hregion, which is consistent with the read request and can be referenced in the article's description.

service-Side write requests

When the client sends a write request to the server, the server begins to execute the write request operation. Hregionserver forwards the write request to the specified hregion execution, hregion each operation is processed in the bulk write request unit. The main process is implemented in Hregion.dominibatchmutation, roughly as follows:

    1. Gets the row lock for the specified row in the write request. Because there is no guarantee of consistency between these bulk write requests (only row consistency is guaranteed), each attempt is made to block only the row lock that gets at least one write request, and the other fetched row lock skips the update, waiting for the next iteration to continue trying to get
    2. Update the time stamp for a write request that has obtained a row lock is the current time
    3. Gets the read lock of the hregion Updateslock.
    4. Gets the most recent write sequence number for MVCC (multi-version Concurrency Control), and writes to Memstore with the write request keyvalue data.
    5. Structuring the Wal (Write-ahead Logging) Edit Object
    6. Add the Wal edit object asynchronously to the Hlog and get the TXID number
    7. Release the read lock for Updateslock in step 3rd and the row lock obtained in step 1th
    8. Follow the 6th step txid Sync Hlog
    9. Commit the transaction, move the MVCC read sequence number forward to the 4th step to get the write sequence number
    10. If the above steps fail, roll back the data that has been written to Memstore
    11. If the size of the Memstore cache exceeds the threshold, the Memstore refresh operation for the current hregion is requested.

After the above steps, the write request belongs to the committed transaction, and the subsequent read request can read the data of the write request. These steps contain the various features of hbase, primarily to ensure the performance of a respectable write request, while also ensuring a row-level transaction acid feature. The following is a detailed analysis of some of the main steps of the specific situation.

Hregion's Updateslock

The updateslock of hregion is obtained in step 3 to prevent a thread conflict between the Memstore and the write request transaction during the flush process.

The first thing to know is the role of Memstore in writing requests. HBase in order to improve read performance, so that the data stored in HDFs must be ordered, so that can use a variety of features, such as binary lookup, improve read performance. However, since HDFS does not support modification, a measure must be taken to turn random writes into sequential writes. Memstore is to solve this problem. Random write data write such as Memstore in memory can be sorted, when the memstore size exceeds the threshold need to flush to HDFs, stored in hfile format, obviously this hfile data is ordered, so that the random write into sequential write. In addition, Memstore is one of the implementations of the LSM (log-structured Merge tree) of HBase.

When the Memstore is flush, in order to avoid the impact on the read request, Memstore will create snapshot to the current memory data kvset, and empty the Kvset content, read requests in the query KeyValue also query snapshot at the same time, This will not be affected too much. Note, however, that the write request is to write the data into the Kvset, so the lock must be locked to avoid conflicting thread accesses. Because there may be multiple write requests at the same time, the write request obtains the Updateslock Readlock, and snapshot has only one at the same time, so the writelock of Updateslock is obtained.

Get MVCC Write serial number

MVCC is the mechanism by which HBase promotes a concurrency transaction control for read requests while maintaining row-level transactional consistency. MVCC mechanism is not difficult to understand, you can refer to here.

The biggest advantage of MVCC is that read requests and write requests do not block collisions with each other, so read requests generally do not need to be locked (only two write requests that write the same row of data need to be locked), only when the write request is committed, the read request can see the data of the write request, so as to avoid a "dirty read", Transactional consistency is ensured. Specific MVCC implementations can refer to this article of a PMC member in HBase.

WAL (Write-ahead Logging) and Hlog

wal is the mechanism by which hbase can let other nodes perform data recovery in order to avoid the failure of the node to be serviced. When HBase writes the request operation, the default is to encapsulate the keyvalue data into a Waledit object, then serialize it into Hlog, and serialize the Wal in the Protobuf format in version 0.98.8. Hlog is a log file that records HBase modifications and, like the data file hfile, is stored on HDFS, thus guaranteeing the reliability of the Hlog file. This way, if the machine is down, the keyvalue data stored in the Memstore is lost, and hbase can use the modified log recorded in the Hlog for data recovery.

Each hregionserver has only one Hlog object, so all hregion modifications on the current hregionserver are recorded in the same log file, and when data recovery is required, slowly follow the hregion to split the change log (log Splitting).

In the entire write request, the Waledit object serialization write to Hlog is the only step that can occur I/O, which can greatly affect the performance of the write request. Of course, if the business scenario does not require high data stability, the key is to write the request, then you can call Put.setdurability (Durability.skip_wal), so you can skip this step.

In order to alleviate the impact of I/O generated by the write Hlog, HBase uses a more granular multithreaded concurrency pattern (detailed reference to HBASE-8755). The implementation of Hlog is Fshlog, and the main process involves three objects: Asyncwriter, Asyncsyncer, and Asyncnotifier. The entire write process involves step 5-8.

      hregion calls Fshlog.appendnosync, adds the modification record to the local buffer, notifies Asyncwriter that a record is inserted, and returns a long-incrementing TXID as the modified record. Notice that this is an asynchronous call. The
    1. hregion immediately releases the Updateslock read lock and the obtained row lock, and then calls Fshlog.sync (TXID) to wait for the previous modification record to be written to Hlog. The
    2. Asyncwriter removes the modified record from the local buffer, then compresses the record and writes the PROTOBUF to the Fsdataoutputstream cache, and then notifies Asyncsyncer. Because Asyncsyncer has a large workload, there are 5 threads in total, and Asyncwriter chooses one to wake up. The
    3. Asyncsyncer determines if any other asyncsyncer thread has completed the synchronization task, and if so, continues to wait for the Asyncwriter synchronization request. Otherwise, the Fsdataoutputstream cache is written to HDFs, and the task of waking Asyncnotifier
    4. Asyncnotifier is simpler, simply waking up all the write request threads waiting to be synchronized. In fact, however, the process is also time-consuming, so separate asyncnotifier threads instead of Asyncsyncer to complete the notification task. The
    5. Hregion was woken up and found that its txid had been synchronized, that is, the modification record was written to Hlog, and then the other operations.

In the above writing process, the 2nd step hregion write the record to the Hlog buffer, and then release the previously acquired lock before the synchronization wait for the write completion, which can effectively reduce the lock holding time, improve the concurrency of other write requests. In addition, the new write model composed of Asyncwriter, Asyncsyncer, and Asyncnotifier is primarily responsible for HDFS write operations, comparing old write models (requiring each write request thread to write HDFs, a large number of threads causing serious lock contention), The most important is to greatly reduce the lock competition during thread synchronization, and effectively improve the throughput of the thread. This write process can improve throughput for high-volume write requests, but in environments where write requests are less concurrent and threads are less competitive, each write request must wait for synchronization between async* threads, increasing the overhead of thread context switching. Results in a slightly degraded performance (the Lmax disruptor synchronization model was used in version 0.99, and the Fshlog was reconstructed, HBASE-10156).

MVCC read serial number move forward

After the completion of the Hlog write, the entire write request transaction has completed the process, so it is necessary to commit the transaction, so that other read requests can see the write request data. The role of MVCC has been slightly described here, and here is how MVCC handles read ordinal moves forward.

MVCC maintains a long-type write sequence number memstorewrite, a long read sequence number Memstoreread, and a queue writequeue. When Hregion calls Beginmemstoreinsert to assign a write sequence number, it will increment the write sequence number by 1 and return it, adding a write request to the Writequeue tail. The code is as follows:

Public WriteEntry Beginmemstoreinsert () {  synchronized (writequeue) {    long nextwritenumber = ++memstorewrite;< C2/>writeentry e = new WriteEntry (nextwritenumber);    Writequeue.add (e);    return e;  }}

Hregion this to write the serial number and each new inserted keyvalue data to associate. When the write request is completed, Hregion calls Completememstoreinsert request read ordinal forward, MVCC first write request to complete, and then view the Writequeue queue, starting from the head of the queue to take out all the completed write requests, The last completed write request sequence number will be assigned to Memstoreread, indicating that this is the current maximum readable read sequence number, if the hregion write request is smaller than the reading sequence number, then completed the transaction commits, otherwise hregion will always wait for the submission to complete. The relevant code is as follows:

public void Completememstoreinsert (WriteEntry e) {Advancememstore (e); Waitforread (e);}    Boolean Advancememstore (WriteEntry e) {synchronized (writequeue) {e.markcompleted ();    Long nextreadvalue =-1;      while (!writequeue.isempty ()) {ranonce=true;      WriteEntry Queuefirst = Writequeue.getfirst ();        ... if (queuefirst.iscompleted ()) {nextreadvalue = Queuefirst.getwritenumber ();      Writequeue.removefirst ();      } else {break;        }} if (Nextreadvalue > 0) {synchronized (readwaiters) {memstoreread = Nextreadvalue;      Readwaiters.notifyall ();    }} if (Memstoreread >= e.getwritenumber ()) {return true;  } return false;  }} public void Waitforread (WriteEntry e) {Boolean interrupted = false;      Synchronized (readwaiters) {while (Memstoreread < E.getwritenumber ()) {try {readwaiters.wait (0);    } catch (Interruptedexception IE) {//...} }  }}

Thus, MVCC guarantees the serial order of the transaction commits, and if a write request is successfully submitted, any write request that is less than the write sequence number must be successfully submitted. Therefore, at the time of reading the request, it is possible to read the write data of any newly submitted successful write request as long as the MVCC read request sequence number is obtained. In addition, MVCC only restricts the serial of the process that the transaction commits, and during the actual write request, the other steps are allowed concurrently, so there is no significant performance impact.

At this point, the transaction submission process for a write request to HBase is complete. Throughout the writing process, a number of methods have been used to avoid lock competition, shorten the time to acquire locks, and ensure transactional consistency. Since Memstore always has a size limit on the memory cache, when Memstore exceeds the threshold, HBase will refresh the data to HDFs to form a new hfile. Next look at the process.

Flush of the Memstore

When a large amount of write request data is added to the Memstore, Memstore exceeds the threshold, hregion requests that the Memstore data be flush to HDFS. Also note that the flush unit here is a single hregion, that is, if there is more than one hstore, as long as there is a memstore over the threshold, all hstore of this hregion will perform the flush operation.

    • Hregion first to obtain the Updateslock write lock, so as to prevent the arrival of a new write request
    • Request to get MVCC's write sequence number
    • Request Memstore Build Snapshot
    • Release Updateslock's Write lock
    • MVCC Write sequence number before commit, wait for previous transaction to complete, prevent rollback TRANSACTION write hfile
    • Write snapshot's keyvalue data into hfile.

The main focus is to look at the snapshot keyvalue data written to the hfile section. First look at the format of the hfile:

This format of hfile has been introduced in the previous read request article. Hfile to ensure that each hblock is approximately 64KB in size, the hfile structure is constructed using DataBlock Multi-level indexes and Bloomfilter first-class indexing methods. The whole writing process is relatively simple, in the loop to facilitate the acquisition of Memstore snapshot keyvalue data, and then constantly write DataBlock, if the total size of the current datablock more than 64KB, DataBlock stops adding data (compression is compressed), computes the index of the DataBlock, adds it to memory, and writes the corresponding Bloomblock if the Bloomfilter property is turned on. This process will take care to save FileInfo data such as uncompressed size.

When all snapshot data has been written to DataBlock, it is time to start writing to DataBlock's multilevel index. HBase calculates the progression of multilevel indexes based on previously saved indexes, and if the number of indexes is not large, it is possible that only rootindexblock one level. Midkey data will also be obtained based on the Rookindexblock. Finally, the indexes of FileInfo and Bloomfilter are written sequentially, along with trailer.

Summary

HBase uses Memstore to turn random writes into Sequential writes, which helps improve the efficiency of read requests. In addition, to avoid data loss, use Hlog to record the modification log. In the whole writing process, the use of various means to reduce the lock competition, improve the thread throughput, but also pay attention to shorten the lock acquisition time, as much as possible to improve concurrency. The impact of Read and write requests is also avoided through the use of MVCC.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

HBase Write Request Analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.