Hbase Region Server Overall architecture

Last Update:2015-09-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The overall architecture of Region server

This paper mainly introduces the whole structure of region, and then introduces the specific realization and source code of each part of region.

Regionserver Logical architecture Diagram

Regionserver Responsibilities

1, monitoring and collaboration, through ZK to listen to master, Meta location, cluster status and other information changes, update local data.

2, the management of the region offline, online, open, close and other operations, these operations are and hmaster with this to do, region of the state has the following

Offline, opening, open, closing, close, offline and other such states, Hmaster through ZK and Regionserver coordinate specific region of the line, offline, operation timeout and other operations.

3, Rpcservice service, distribute the read and write requests received to the specific region to execute.

In 1.0, the RPC service makes a distinction between the types of requests, the priority, and so on, with different handler handling different priority requests.

4, there are a number of global threads to monitor, discover, execute the specific region of the flush, compaction, split the core operations of the three region.

5, Logroller, the time to cut the Wal log, is a running thread, will be timed roll all the Wal, can also accept the external roll request, and then the log to split.

Wal, write operations log, the entire regionserver to maintain such a log, all of the region's Wal is written in this log.

After the Wal is roll, the flush request request is sent to the relevant region.

6, leases,leases management mechanism, all region related to the timeout operation is registered in the lease, periodic unified check remove expire and call Expirehandler.

7. Pending additions

Regionserver the internal thread

Healthchecker: This is responsible for checking the health of the cluster, you can customize the execution of the script (configure the path), and then periodically to execute and identify the return value.
Pausemonitor: This is a timed call to JMX in the GC interface, check the current virtual machine GC status, GC number and time consumption, and alert.
Cacheflusher: Check the flush request and flush Memstore
Compactsplitthread:compactpact
Compactionchecker: Regular check, submit compactionrequest.
Periodicflusher: Regular refresh, regular submission Flushrequest
Lease: Management of lease in the entire RS, e.g. Scaner lesase
Storefilerefresher: Update the region Store file for second region owner regularly.

Other points about the Regionserver

Each regionserver start, will be assigned a startcode, and Host,port,startcode Unified form a regionserver unique symbol, so after a machine restart and before is actually two different RS, so can be distinguished.

Region

Region The logical architecture diagram

Region Components & Processes

Region composition

Each region represents a partition of one data in a table, each region contains multiple stores, each store corresponds to a family read and write operation, and a store contains one memstore and multiple hfilestore. The data written is written directly into the Memestore, and then periodically refreshed into the storage (HDFS) to form a hfile, when read will synthesize Memstore and all the data in Hfilestore.

Each region provides a split, flush, compaction strategy and action method, but this trigger and execution is done specifically by threads in Regionserver.

MVCC Multi-version Control Protocol

Multi-version Control Protocol implementation class, HBase is used in the multi-version control protocol, to do the rollback of operations, the atomicity of the operation, the operation is not visible when the time of the problem.

In HBase, the Multiversionconsistencycontrol class is used to manage multiple versions of the control Protocol, which can support multiple writes at the same time, if one of the write operations fails, the corresponding write operation can be rollback.

The specific operation is to get a unique MVCC version number for each/batch operation, first write this part of the operation data into Memstore memory--write to the Wal--other operations, if one of the steps fail, can be more MVCC version The ID deletion has been written to the Memstore (in fact, Memstore can be understood as a simple sortset<keyvalue>, and the failure to delete the corresponding KeyValue directly is OK).

Multiversionconsistencycontrol maintains a readpoint throughout the system, which, when read, is required to control atomicity (non-finished data is not visible), then just read MVCC version <= The readpoint can be.

With regard to the readpoint Update, when a write operation is completed, the Readpoint is updated because it can support multiple write operations, but there is only one readpoint, So a write operation queue is stored in Multiversionconsistencycontrol, and the write operation registers itself in the queue at the beginning, when there is a write complete, The write MVCC operation is set to complete the OK state, with which the MVCC write Queqe is traversed, and the entire queue is traversed, the Readpoint is updated and the write Readpoint version before MVCC is removed.

There is a unique multiversionconsistencycontrol in a region, so the current operational atomicity is supported to the region level.

The KEYVALUE data structure is as follows:

In memory, there will be a MVCC version number, when written to hfile, will not write this MVCC version number.

Hstore ( for a family)

Store nothing to say, is for a family a store, responsible for the family on the read and write operations, a store contains a memstore and a number of hfilestore.

Memstore

Memstore is the time when HBase writes data (HBase writes the data is a keyvalue,delete realism written in a keyvalue, but type is the delete at the time of retrieval to exclude the corresponding KV) of a buffer, All keyvalue are written to Memstore first.

Memstore plainly is a sortset<keyvalue> (a concurrentnavigablemap<kv,kv> is encapsulated below), and all written data is written directly to the set. Memstore also maintained a Memstorelab, the function is to add kv, the keyvalue memory to the global Memstorechunkpool (this is for a large memory, the internal memory is a piece of, for short, a chunk , this is primarily to avoid memory fragmentation, because the flush of memstore is quite frequent for high concurrent writes.

About the flush operation of the Memstore, the Memstore maintains two kvset, one is the normal kvset one is snapshotkvset, normally the data is written to Kvset, snapshot is empty, Flush when the current kvset will be assigned to Snapshotkvset,kvset re-new, switch (when switching is required to add Memstore global lock, but this time is very short), because the snapshotkvset is unchanged, So you can slowly refresh into HDFs, in the brush write HDFs, but in the refresh time need to solve two problems:

1, brush the time, snapshot can not change (because this time there may be rollback AH, rollback time will also be found from the snapshot, if found also deleted): This solution is before the snapshot will wait for all the current MVCC Write request is complete before flush.

2, read the data: read when in fact Memstorescanner will retrieve the two kvset, which is actually a heapscanner

Memstorechunkpool ( Unified buffer)

Unified buffer in the Memstore, Put Memstore Kvset keyvalue are first system.copyarray here in a chunk, in addition to kvset, mainly to reduce memory fragmentation, this memstorechunkpool is also globally unique singleton, all Memstore Space will apply on this.

StoreFile

A storefile is a hfile control, the format of the hfile is shown in an attached PDF.

Put/delete

Put and delete is actually the same process, constructed into a keyvalue, and then put into the Memstore, if the transaction, then all involved in the rowlock together to apply for the lock, not the case, is the same line of processing, lock is actually an application.

Atomic transition

HBase is to support the operation of the transition, the specific operation is also through the MVCC to control, this batch of operations is the same MVCC version, if success is successful, if the failure is rollback, Corresponding on the Hbaseclient interface is the mutate interface, generally non-atomic line

scan/get ( Heapscanner of LSM)

The Get and scan in HBase is eventually converted to scan, except that get gets just a row of data, scanner's search is a typical LSM tree search, as follows

Heapscanner is written for the search of the LSM number structure, and the subtree of the LSM number is an ordered Scanner,heapscanner to make the keyvalue of the next method return orderly, Its child nodes may also be a heapscanner, and this process can be iterative.

The inside of the Heapscanner is actually putting all the scanner in a priorityqueue<keyvaluescanner> heap, the heap queue's comparator uses the first element of scanner as ( Peek out), so each time the heapscanner.poll out of the element is the first element of the smallest scanner,poll out the first element, and then put this scanner into the heap queue, and then take out the time, get the smallest.

The general flow of the entire scanner is as follows:

Hregion.getscanner (Scan)--->

Hregion.getscanner (Scan, Additionalscanners)(family Check)

--->hregion.instantiateregionscanner (scan,additionalscanners)

---->regionscannerimpl (Scan, additionalscanners, region)

READPT: According to IsolationLevel to determine whether you need to read the latest not followed by the new and updated in the process of data, you can also choose to read only the scanner created by the time MVCC complete the latest readpoint.

Isscan = Scan.isgetscan ()? -1:0;

scanners = new arraylist<keyvaluescanner> ();

Joinedscanners = new arraylist<keyvaluescanner> ();

Typical LSM tree species of the scanner, generally do not appear joinedscanners situation, are scanners, including each store a scanner and additional scanner

Each store obtains a scanner by Store.getscanner (Scan, Entry.getvalue (), THIS.READPT), with the parameters scan, Qualifiler, Readpoint

---->regionscannerimpl (scan,additionalscanners,region)

Storeheap = new Keyvalueheap (scanners, region.comparator);

Regionscannerimpl the main retrieval property, filter can also set where to stop the property, through the Filterallremaining method returns True

----->regionscannerimpl.nextinternal (outresults, limit)

And eventually fall into this method.

----->keyvalueheap.next (list<cell> result, int limit)

This is actually Heapscanner's retrieval logic, this heapscanner as shown, the middle process is basically packaging a layer of heapscanner, will eventually fall to Memstorescanner and on the Storefilescanner, on the above two storescanner will go through scanquerymatcher This, its role is to deal with the expiration, deletion, maxversion and other filtering.

Memstorescanner

Memstore It is itself an ordered set, so the direct retrieval on the line, but if there is no MVCC version of the control, in fact, will retrieve the latest write KV.

Storefilescanner

Hfilestore scanner logic will be more complex, each scanner is a new Hfilereader (this if placed on a single disk, too many scanner words, will be very slow, because each get will actually turn into scanner, In fact, each scanner will be a hard drive addressing, may be slower), this is actually from the specific KV start positioning, first hfile index (each block startrow, Endrow) information, and then locate the block, the need for concrete block when , first go to the cache inside the query, if the cache is not, then go to the hard disk to read, read the first place in the cache.

StoreFile Retrieving the call flow

Storefilescanner.next----->hfilescanner (The specific implementation is ScannerV2). Next (), first found in the buffer of the current block, not found, looking for the next block ( Of course, the premise is to determine whether there is a last block)------->abstractscannerv2.readnextdatablock------->scannerv2.readblock, This method constructs the block's CacheKey (Hfilename+offset +encode), which is now found in the cache.

Blockcache ( Unified Management Cache)

The specific cache configuration is configured by CacheConfig, using the Lrublockcache cache, and then depending on the configuration to determine whether to add Bucketcache, as its headset cache, which is configured by the Hbase.bucketcache.ioengine, such as a series of parameters, see Cacheconfig.instantiateblockcache () method, this cache is a singleton, globally unique, the default is only LRU cache, No Bucketcache as a level two cache, this standalone version has been tracked, not really)

A blog about several caches in HBase

http://www.cnblogs.com/cenyuhai/p/3707971.html

Flush

The flush operation is defined internally by the region, but calls and memory management are external (region server) to manage the trigger externally there is a flushrequester to manage the region to be flush, and then the flush operation is triggered periodically, As mentioned above, this flush operation is actually defined within the region, but is invoked in another thread.

Flush is in region, a region may contain multiple memstore, if one reaches the flush condition, then the overall flush of all the memstore under this region, the general flush trigger conditions are as follows:

The time from the last flush reached limit;

The amount of data already written has reached limit;

The number of Chang that has changed (transition num) reaches limit;

Wal was roll after;

The trigger condition of the data volume is checked before each write operation, the limit is triggered flushrequest, and the rest is checked periodically by the fixed thread in RS, and the Shouldflush () in region is called to determine that the flush for a region is single and synchronous, At the same time there may only be one flushrequest, the same directly do not receive, at the same time there can only be a flush operation, the specific flush operation in the method Region.internalflushcache, this in the Memstore will be said in detail.

Compact

The compact operation is also, its selectcompactfile (if the file of select is empty, then the logic that does not need the compact) is in the region, For the specific logic of select, see the Hstore.requestcompaction () method, select the appropriate hfile, then read the files and merge them into a hfile, and eventually switch to reader in the store.

The selection process will be judged by the need to minor compaction or major compaction, and then compactionrequest the way to submit to Regionserver, By Regionserver to adjust the specific compaction operation, the region will be based on the needs of the compact hfile size into big compaction and small compaction, and then by different threads to execute.

offpeakhours, a very interesting thing, can be set not to do the compact time period [Starthour,endhour]

Starthour Configuration: Hbase.offpeak.start.hour

Endhour Configuration: Hbase.offpeak.end.hour

Selection strategy of minor compaction

Minor Compact's choice strategy is to choose as many small, multi-hfile to do the compact

Split

1, split is determined by the strategy provided by Regionsplitpolicy, 0.94 is used after the default policy is Increasingtoupperboundregionsplitpolicy, of course, this can also be customized configuration ( Can be configured Hbase.regionserver.region.split.policy to configure, can also implement a single table independent, write in the CREATE table when the MetaInfo, to overwrite this configuration)

2. Several methods about Regionsplitpolicy

Byte[] Getsplitpoint (): The default implementation is to take the size of the largest store's split point, each store's splitpoint is its own storage management.

Shouldsplit () to determine if the region needs split

3, the default implementation increasingtoupperboundregionsplitpolicy

Whether split is determined by some parameters

Maxfilesize:tablemeta, or hbase.hregion.max.filesize to define, the previous priority to use

3. Split decision Strategy

There was a region when the first split was flush, and after this split there were two region

The second time when reached (Min (2*2*flushsize, splitsize)), this split has three region

The third time when you reach (Min (3*3*flushsize, splitsize)), there are four region after this split.

And so on ........

WAL and Editlogreplay

Editlog that is the Wal write log, located in the HDFs, because the write operation is written in Memstore, unexpected downtime, this part of the data is gone, so need some memory when writing a copy on the hard disk, the entire regionserver wal is the only one, All of the region's Editlog are written together, Editlog storage is actually a hlog.entry,hlog.entry structure of the following:

Waledit:keyvalue List

Hlogkey:tablename, Regionencode, Squenceid, writetime, etc.

Each entry flag belongs to which region, and there is squenceid, wait until need to recove corresponding region (crash), will first Editlog according to region, each region Editlog existence ${region _dir}/ Recovered.edits, this region started when the discovery of this file will start the replay process, see Region.replayrecoverededitsifany method, in fact, is to judge this log Squenceid and hfile Squen CeId (hfile writes the maximum and minimum squenceid) for comparison, see if it has been cured to hfile, if not, then write these operations back to Memstore, otherwise skip.

For comprocess, only Prewalrestore, Postwalrestore operations are performed, and no other is performed.

Hbase Region Server Overall architecture

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hbase Region Server Overall architecture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hbase Region Server Overall architecture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support