Blog Description: 1, research version hbase 0.94.12;2, posted source code may be cut, only to retain the key code.
Discusses the HBase write data process from the client and server two aspects.
Client Side
1. Write Data API
Write data is mainly htable and batch write two API, the source code is as follows:
Single Write API
public void put (final put) throws IOException {
DoPut (Put);
if (AutoFlush) {
Flushcommits ();
}
}
Batch Write API
public void put (final list<put> puts) throws IOException {
for (put put:puts) {
DoPut (Put);
}
if (AutoFlush) {
Flushcommits ();
}
}
Specific put implementation
private void DoPut throws ioexception{
Validateput (Put);
Writebuffer.add (Put);
Currentwritebuffersize + + put.heapsize ();
if (Currentwritebuffersize > Writebuffersize) {
Flushcommits ();
}
}
public void Close () throws IOException {
if (this.closed) {
Return;
}
Flushcommits ();
....
}
You can see from the two put APIs that if AutoFlush is false, whether the batch write effect is the same, Write data requests are submitted more than the configured writebuffersize (by Hbase.client.write.buffer configuration, default 2M), and if the final write data does not exceed 2M, the final commit is made when the Close method is called, and, of course, if you use the batch put method, the Control flushcommits The effect is different, such as every 1000 to commit once, if the total size of 1000 data exceeds 2M, it will actually occur multiple submissions, resulting in more than the number of submissions only by the Writebuffersize control of the number of submissions, Therefore, in the actual project, if the requirement for write performance is higher than the real-time query and the need to be lost of the data, you can set the AutoFlush to false and use a single write (final put) API to simplify the program code of the write operation data. Write efficiency is also better, it should be noted that if there is a high demand for real-time query and not loss of data, you should set AutoFlush to true and use a single write API, so that you can make sure that you write one to submit one.
2, about multithreaded writing
In this version of 0.94.12, for write operations, HBase internal is multithreading, the number of threads and the amount of data submitted by the same region number, usually do not need to write their own multithreaded code, their own multithreaded code is mainly to solve the data to htable put this process of performance problems, data into put cache, when to reach the WR Itebuffersize the size of the set will actually initiate write operations (if not their own control flush), the process of the number of threads and this batch of data involved in the same number of region, will be written in parallel to all relevant region, generally does not appear performance problems, When the number of region involves creating too many threads, consuming a large amount of memory, or even a thread running out of memory to cause outofmemory, the ideal writing scenario is to turn up the writebuffersize, and write the appropriate amount of different regionserver region, so that the write pressure can be fully distributed to multiple servers.
HBase writes the data The client core method is the Hconnectionmanager Processbatchcallback method, the related source code is as follows:
public void Flushcommits () throws IOException {
try {
Object[] results = new object[writebuffer.size ()];
try {
This.connection.processBatch (WriteBuffer, tablename, pool, results);
catch (Interruptedexception e) {
throw new IOException (e);
finally {
...
finally {
...
}
}
public void Processbatch (list<? extends Row> List, final byte] tablename, Executorservice pool,
Object[] results) throws IOException, Interruptedexception {
...
Processbatchcallback (list, tablename, pool, results, NULL);
}
Public <R> void Processbatchcallback (list<? extends Row> List, byte] tablename, Executorservice pool,
Object[] results, batch.callback<r> Callback) throws IOException, Interruptedexception {
....
hregionlocation [] lastservers = new Hregionlocation[results.length];
for (int tries = 0; tries < numretries && retry; ++tries) {
...
Step 1:break up to regionserver-sized chunks and build the data structs
Map<hregionlocation, multiaction<r>> actionsbyserver =
New Hashmap<hregionlocation, multiaction<r>> ();
for (int i = 0; i < workinglist.size (); i++) {
Row row = Workinglist.get (i);
if (row!= null) {
Hregionlocation loc = Locateregion (tablename, Row.getrow ());
BYTE] Regionname = Loc.getregioninfo (). Getregionname ();
multiaction<r> actions = Actionsbyserver.get (loc);
if (actions = null) {
actions = new multiaction<r> ();
Actionsbyserver.put (Loc, actions); Each region corresponds to a Multiaction object, and each Multiaction object holds all the put Action for that region
}
action<r> action = new action<r> (row, i);
Lastservers[i] = loc;
Actions.Add (Regionname, action);
}
}
Step 2:make the requests, each region opens a thread
Map<hregionlocation, future<multiresponse>> futures =
New Hashmap<hregionlocation, Future<multiresponse>> (Actionsbyserver.size ());
For (Entry<hregionlocation, multiaction<r>> e:actionsbyserver.entryset ()) {
Futures.put (E.getkey (), Pool.submit (Createcallable (E.getkey (), E.getvalue (), tablename));
}
Step 3:collect the failures and successes and prepare for retry
...
Step 4:identify failures and prep for a retry (if applicable).
...
}
...
}
3, before writing data, need to locate specific data should be written to the region, the core method:
Locate the region from the cache, implement it through NAVIGABLEMAP, and query if there is no cache. META. Table
Hregionlocation getcachedlocation (Final byte [] tablename,
Final byte [] row {
Softvaluesortedmap<byte [], hregionlocation> tablelocations =
Gettablelocations (tablename);
...
Find the region corresponding to the startkey that is less than Rowkey and closest to Rowkey, implemented by Navigablemap
Possibleregion = Tablelocations.lowervaluebykey (row);
if (possibleregion = = null) {
return null;
}
The last region of the table EndKey is an empty string, and if not the last region, the EndKey is returned only if the Rowkey is less than region.
BYTE] EndKey = Possibleregion.getregioninfo (). Getendkey ();
if (Bytes.equals (EndKey, hconstants.empty_end_row) | |
Keyvalue.getrowcomparator (tablename). Comparerows (
EndKey, 0, endkey.length, row, 0, row.length) > 0) {
return possibleregion;
}
return null;
}
Second, the service side
The main process of the service-side write data is: Write Wal log (if not close write Wal log)-"Write memstore-" trigger flush Memstore (if the memstore size exceeds the set value of Hbase.hregion.memstore.flush.size), the compact and split operations may be triggered during the flush memstore process. The following sections explain the Put method, flush Memstore, compact, and split.
1, Htableinterface interface operation hbase Data API corresponding Server is implemented by the Hregionserver class, the source code is as follows:
Single put
public void put (final byte] regionname, final put put) throws IOException {
Hregion region = getregion (regionname);
if (!region.getregioninfo (). Ismetatable ()) {
Check that the total memory footprint of the Hregionserver Memstore has exceeded Hbase.regionserver.global.memstore.upperLimit (default is 0.4) or HBASE.REGIONSERVER.GLOBAL.MEMSTORE.L Owerlimit (default is 0.35), if exceeded, adds a task in the flush queue that, if it exceeds the upper limit, blocks all write-memstore operations until the memory falls below the lower limit.
This.cacheFlusher.reclaimMemStoreMemory ();
}
Boolean Writetowal = Put.getwritetowal ();
Region will call the store's add () method to save the data to the Memstore in the relevant store
After the data is saved, region checks for the need for flush memstore and, if necessary, emits a flush request, which is executed asynchronously by Hregionserver's flush daemon thread.
Region.put (Put, Getlockfromid (Put.getlockid ()), Writetowal);
}
Batch Put
public int put (final byte] regionname, final list<put> puts) throws IOException {
Region = Getregion (regionname);
if (!region.getregioninfo (). Ismetatable ()) {
This.cacheFlusher.reclaimMemStoreMemory ();
}
Operationstatus codes[] = region.batchmutate (putswithlocks);
for (i = 0; i < codes.length i++) {
if (Codes[i].getoperationstatuscode ()!= operationstatuscode.success) {
return i;
}
}
return-1;
}
2, Flush Memstore
The Memstore flush process is controlled by the class Memstoreflusher, which is the runnable implementation class that initiates a hregionserver daemon when the memstoreflusher starts. The flush task is refreshed every 10s from Flushqueue, and if flush memstore is required, Simply call the Memstoreflusher Requestflush or Requestdelayedflush method to add the flush request to the flush queue, and the specific flush is executed asynchronously.
The size of the Memstore has two levels of control:
1) Region level
A, Hbase.hregion.memstore.flush.size: default value of 128M, more than will be flush to disk
b、 Hbase.hregion.memstore.block.multiplier: The default value of 2, if the Memstore memory size has exceeded hbase.hregion.memstore.flush.size twice times, will block the region write operation until the memory size drops below the value
2) Regionserver level
A、 Hbase.regionserver.global.memstore.lowerLimit: The default value 0.35,hregionserver all Memstore occupy the lower proportions of the total hregionserver memory, and when that value is reached, the entire regionser is triggered Ver's flush (does not really flush all region, refer to the subsequent content) until the total memory ratio drops below the limit
b、 Hbase.regionserver.global.memstore.upperLimit: The upper proportions of total memory exist in all memstore of the default value 0.4,hregionserver, and when that value is reached, the regionserver of the entire flush is triggered until the total memory The proportions fall below the limit and block all write-memstore operations before falling below the limit ratio
In the flush operation of the entire hregionserver, it will not refresh all the region, but each time according to region Memstore size, storefile number and other factors to find the most flush region to flush, Flush after the completion of the total memory ratio of the judgment, if not yet reduced to lower limit below will find a new region for flush.
Flush region will flush all the store under the region, although there may be little memstore content in some store.
The flush Memstore produces a updateslock (a property of the Hregion class, with the JDK's Reentrantreadwritelock implementation) lock write lock, When the Updateslock write lock is released after capturing the Memstore snapshot, all operations that need to obtain the Updateslock write, read lock are blocked, and the effect is the entire hregion range, Therefore, if the number of hregion in the table is too small, or if the hot spot in the data write will cause the region to flush Memstore as the process produces a write exclusive lock (although the time for the Memstore snapshot will be very fast), Therefore, it will affect the overall writing ability of region.
3. Compact operation
HBase There are two kinds of compact:minor and major,minor that usually combine several small storefile into one large storefile,minor do not delete the data marked as deleted and the expired data, and the major will delete the data. After the major merge, a store has only one storefile file, which rewrites all the data in the store, has a large resource overhead, and major merges the default 1-day execution. You can configure the execution cycle through hbase.hregion.majorcompaction, typically, this value is set to 0 for close, manual execution, which avoids the major merging of the entire cluster when the cluster is busy, major merging is an action that must be performed because deleting data that is marked for deletion and expiration is in the process of merging Line. The merge allows you to combine two region of the table to reduce the number of region and execute the command:
$ bin/hbase org.apache.hadoop.hbase.util.Merge <tablename> <region1> <region2>
Parameters <region1> need to write region name, such as:
Gd500m,4-605-52-78641,1384227418983.ccf74696ef8a241088356039a65e1aca
You need to stop running the HBase cluster when you perform this operation, and if HDFs does not have the same user group and user as HBase and HDFS is configured to require permission control (by the configuration item Dfs.permissions control, default is True) When you need to switch Linux users to the HDFs user to perform this operation, after the completion of the implementation, the need to use the Hadoop dfs–chown command to change the merged new region user to hbase users, otherwise it will cause after the start of the HBase management interface can not see any table problems, As shown in the diagram, the new synthesized region in the HDFs path information:
As can be seen from the figure above, the user of the new region of table gd500m is HDFs, execute the following command, modify the owning user, note that it is best to make user changes to the HBase root directory, because not only the new region user is HDFs, Some of the log files formed during the merge process are also HDFS users
Modified to view the new region user information, has been changed to HBase:
The level of the compact merge
1), the entire hbase cluster
The hregionserver starts with a daemon thread to scan the cluster under all online region under the StoreFile file, for all compliant store.needscompaction () or Store.ismajorcompaction (), the default cycle is 10,000 seconds (about 2.7 hours), and if Hbase.hregion.majorcompaction configured to 0 The daemon will never trigger major merge source code as follows:
Threadwakefrequency Default value is 10*1000,multiplier 1000, unit: milliseconds
This.compactionchecker = new Compactionchecker (this, this.threadwakefrequency * multiplier, this);
Chore is a Compactionchecker method that is timed to perform minor and major compcat merging, If the hbase.hregion.majorcompaction is configured to 0, the major merge is not performed, except for minor upgrade to major.
protected void chore () {
For (Hregion r:this.instance.onlineregions.values ()) {
if (r = = null)
re-enters;
For (Store s:r.getstores (). values ()) {
try {
if (S.needscompaction ()) {
If all the storefile files in the store need to be merged, they are automatically upgraded to the major merge
This.instance.compactSplitThread.requestCompaction (R, S, GetName ()
+ "Requests compaction", NULL);
else if (s.ismajorcompaction ()) {
if (majorcompactpriority = = Default_priority
|| Majorcompactpriority > R.getcompactpriority ()) {
This.instance.compactSplitThread.requestCompaction (R, S, GetName ()
+ "Requests major compaction; Use default priority ", NULL);
else {
This.instance.compactSplitThread.requestCompaction (R, S, GetName ()
+ "Requests major compaction; Use configured Priority ",
This.majorcompactpriority, NULL);
}
}
catch (IOException e) {
Log.warn ("Failed major compaction check on" + R, E);
}
}
}
}
Store to remove the remaining number of StoreFile after the storefile that is executing the compact if it is greater than the minimum possible number of merges for the configuration, a compact merge can be made. The minimum number of merges is configured by Hbase.hstore.compactionThreshold, with a default of 3 and a minimum value of 2.
public Boolean needscompaction () {
Return (Storefiles.size ()-filescompacting.size ()) > minfilestocompact;
}
Is Major merged
Private Boolean ismajorcompaction (final list<storefile> filestocompact) throws IOException {
Boolean result = false;
Calculates the time of the next major merge based on the major consolidation cycle of the hbase.hregion.majorcompaction configuration, and no major merge if set to 0
Long mctime = Getnextmajorcompacttime ();
if (filestocompact = null | | filestocompact.isempty () | | | mctime = 0) {
return result;
}
Todo:use decoupled method for determining stamp of the last major (HBASE-2990)
The longest time in the store that has not been modified storefile files, as the time of the last major merger to determine the time of the next major consolidation, which is not reasonable and may result in deferred execution of the major merge, In extreme cases, major merges will never occur.
Long Lowtimestamp = Getlowesttimestamp (filestocompact);
Long now = System.currenttimemillis ();
Major merging is possible only when the major merge time is reached
if (Lowtimestamp > 0l && lowtimestamp < (now-mctime)) {
Major compaction time super-delegates elapsed.
if (filestocompact.size () = = 1) {
StoreFile SF = filestocompact.get (0);
The time lag between the longest and the current time in the store
Long oldest = (Sf.getreader (). Timerangetracker = = null)?
Long.min_value:
Now-sf.getreader () Timerangetracker.minimumtimestamp;
if (Sf.ismajorcompaction () && (This.ttl = = Hconstants.forever | | Oldest < THIS.TTL)) {
If the column cluster does not set an expiration time (through hcolumndescriptor.settimetolive () setting), you do not need to delete the expired data by major merge.
}
else if (This.ttl!= hconstants.forever && oldest > This.ttl) {
result = true;
}
else {
result = true;
}
}
return result;
}
2), table-level
Compact merges (minor or major) can be made by hbaseadmin or Compactiontool all region and columns that may be published. The hbaseadmin can also touch the compact operation of the specified column cluster under publication.
3), Region level
You can trigger a compact operation (minor or major) for all the clusters under the specified region by hbaseadmin or Compactiontool. Hbaseadmin can also trigger the compact operation of a specified cluster under region.
The merge tool can combine any two region under a given table into a single region, triggering region major compact before merging the region.
The flush Memstore process triggers the current region compact, writing data or split region triggers flush memstore.
4), cluster level (store level)
There are many situations that trigger the compact of the store, such as the compact way of executing compactiontool tools, flush Memstore, etc.
Note: The above 4 refers only to trigger the compact operation, but not necessarily the compact, but also to meet the needscompaction () or ismajorcompaction () conditions.
Compact Summary:
1, from the extent of the compact can be divided into: minor and major merger;
2, from the scope of the occurrence can be divided into: The whole cluster, table, region, cluster 4 levels;
3, from the way of triggering can be divided into:
HBase internal automatic triggering (Hregionserver timer, flush Memstore, etc.)
External triggers such as client (HBase management tools, Hbaseadmin (client side management classes), Compactiontool, etc.)
The implementation logic of the Compact is as follows:
The Compactsplitthread class, held only by the Hregionserver class, is called in the following places:
1. Hregionserver's Compact Guard thread
2, Memstoreflusher flushregion
3. Compactingrequest's Run method
Public synchronized Compactionrequest requestcompaction (final hregion R, final Store s,
Final String why, int priority, Compactionrequest request) throws IOException {
...
Compactionrequest CR = S.requestcompaction (priority, request);
...
Cr.setserver (server);
...
Whether it is a SCM merge, only related to the total size of the file participating in the merge, and after a certain value is passed through the SCM merged thread pool,
Note the difference between a major merge and a SCM thread pool may be a minor or a major merge.
The default number of SCM and Sgt Threads is 1, Can be configured via Hbase.regionserver.thread.compaction.large and Hbase.regionserver.thread.compaction.small
Threadpoolexecutor pool = s.throttlecompaction (Cr.getsize ())? Largecompactions:smallcompactions;
Pool.execute (CR);
...
return CR;
}
Store class
Public compactionrequest requestcompaction (int priority, compactionrequest request)
Throws IOException {
...
This.lock.readLock (). Lock ();
try {
Synchronized (filescompacting) {
Candidates = All storefiles not already in compaction queue
List<storefile> candidates = lists.newarraylist (storefiles);
if (!filescompacting.isempty ()) {
Exclude all files older than the newest file we ' re currently
Compacting. This is allows us to preserve contiguity (HBASE-2856)
StoreFile last = Filescompacting.get (Filescompacting.size ()-1);
int idx = Candidates.indexof (last);
Preconditions.checkargument (idx!=-1);
Candidates.sublist (0, idx + 1). Clear ();
}
Boolean override = False;
if (region.getcoprocessorhost ()!= null) {
Override = Region.getcoprocessorhost (). Precompactselection (This, candidates, request);
}
Compactselection filestocompact;
if (override) {
Coprocessor is overriding normal file selection
Filestocompact = new Compactselection (conf, candidates);
else {
Filestocompact = compactselection (candidates, priority);
}
if (region.getcoprocessorhost ()!= null) {
Region.getcoprocessorhost (). Postcompactselection (This,
Immutablelist.copyof (Filestocompact.getfilestocompact ()), request;
}
No files to compact
if (Filestocompact.getfilestocompact (). IsEmpty ()) {
return null;
}
Basic sanity Check:do not try to compact the Mahouve StoreFile twice.
if (! Collections.disjoint (filescompacting, Filestocompact.getfilestocompact ())) {
Todo:change this from a IAE to log.error after sufficient testing
Preconditions.checkargument (False, '%s overlaps with%s ',
Filestocompact, filescompacting);
}
Filescompacting.addall (Filestocompact.getfilestocompact ());
Collections.sort (filescompacting, StoreFile.Comparators.FLUSH_TIME);
Major compaction IFF all storefiles are included
Boolean ismajor = (Filestocompact.getfilestocompact (). Size () = = This.storefiles.size ());
if (ismajor) {
Since we ' re enqueuing a major, update the compaction wait interval
This.forcemajor = false;
}
Modifiable went decoupled than expected. Create a compaction request
int pri = getcompactpriority (priority);
Not a special compaction request, so we need to make one
if (request = = null) {
Request = new Compactionrequest (region, this, filestocompact, Ismajor, PRI);
else {
Update the request with what the system thinks the request should is
It up to the request if it wants to listen
Request.setselection (filestocompact);
Request.setismajor (Ismajor);
Request.setpriority (PRI);
}
}
finally {
This.lock.readLock (). Unlock ();
}
if (request!= null) {
Compactionrequest.prerequest (Request);
}
return request;
}
If the combined total file size exceeds 2 * this.minfilestocompact * this.region.memstoreFlushSize is merged through a large merged thread pool, with a total of two merged pool
Threadpoolexecutor pool = s.throttlecompaction (Cr.getsize ())? Largecompactions:smallcompactions;
Minfilestocompact default value is 3, memstoreflushsize default 128M
Boolean throttlecompaction (Long compactionsize) {
Long Throttlepoint = Conf.getlong (
"Hbase.regionserver.thread.compaction.throttle",
2 * this.minfilestocompact * this.region.memstoreFlushSize);
return compactionsize > Throttlepoint;
}
4, Split
The default split policy class for HBase is: Increasingtoupperboundregionsplitpolicy, You can use the Hbase.regionserver.region.split.policy configuration, or by Htabledescriptor when you are building a table, htabledescriptor specify a split strategy with the highest priority, and here is a explanation of the source code that calculates the split critical size in this class:
Increasingtoupperboundregionsplitpolicy class
Returns the storefile size that needs to be split, which may trigger a split operation if this value is exceeded
Take the region number and memstore size of the calculated value and desiredmaxfilesize comparison of the minimum value, so in the write data, we will find that although the maximum size of the configured region 10G, But HBase does not really wait for the region size to be split, but a variety of split trigger size, when only one region, to memstore size will split, so the design can ensure that writing data can quickly split out multiple region, Take full advantage of cluster resources, and early split will consume less server resources than the late split, because of the small amount of early data.
Long Getsizetocheck (final int tableregionscount) {
return Tableregionscount = = 0? Getdesiredmaxfilesize ():
Math.min (Getdesiredmaxfilesize (),
This.flushsize * (Tableregionscount * tableregionscount));
}
The logic of Getdesiredmaxfilesize is as follows:
If the region size is specified when the table is built, the value specified when the table is built is used, otherwise the value of the Hbase.hregion.max.filesize configuration is used
Htabledescriptor desc = Region.gettabledesc ();
if (desc!= null) {
This.desiredmaxfilesize = Desc.getmaxfilesize ();
}
if (this.desiredmaxfilesize <= 0) {
This.desiredmaxfilesize = Conf.getlong (Hconstants.hregion_max_filesize,
Hconstants.default_max_file_size);
}