An in-depth look at the HBase Architecture

Source: Internet
Author: User
Tags mapr

Https://www.mapr.com/blog/in-depth-look-hbase-architecture

An in-depth look at the HBase ArchitectureAugust 7,Carol McDonald

In this blog post, I'll give you a in-depth look at the HBase architecture and it main benefits over NoSQL data store so Lutions. Be sure and read the first blog post in this series, titled
"HBase and mapr-db:designed for distribution, scale, and speed."

HBase Architectural Components

Physically, HBase is composed of three types of servers in a master slave type of architecture. Region servers serve data for reads and writes. When accessing data, clients communicate with HBase Regionservers directly. Region assignment, DDL (create, delete tables) operations is handled by the HBase Master process. Zookeeper, which is part of HDFS, maintains a live cluster state.

The Hadoop DataNode stores the data, the region Server is managing. All HBase data are stored in HDFS files. Region Servers was collocated with the HDFS datanodes, which enable data locality (putting the data close to where it is n eeded) for the data served by the regionservers. HBase data is the local when it's is written, and when a is moved, it's not local until compaction.

The NameNode maintains metadata information for all of the physical data blocks that comprise the files.

Regions

HBase Tables is divided horizontally by row key range into "regions." A region contains all rows in the table between the region ' s start key and end key. Regions is assigned to the nodes in the cluster, called ' Region Servers, ' and these serve data for reads and writes. A region server can serve about regions.

HBase Hmaster

Region assignment, DDL (create, delete tables) operations is handled by the HBase Master.

A Master is responsible for:

    • Coordinating the region Servers-assigning regions on startup, re-assigning regions for recovery or load Balancing-monit Oring all Regionserver instances in the cluster (listens for notifications from zookeeper)
    • Admin functions-interface for creating, deleting, updating tables

Zookeeper:the Coordinator

HBase uses ZooKeeper as a distributed coordination service to maintain server state in the cluster. Zookeeper maintains which servers is alive and available, and provides server failure notification. Zookeeper uses consensus to guarantee common GKFX state. Note that there should is three or five machines for consensus.

How the Together

Zookeeper is used to coordinate GKFX state information for members of distributed systems. Region servers and the active Hmaster connect with a session to ZooKeeper. The ZooKeeper maintains ephemeral nodes for active sessions via heartbeats.

Each of the region servers creates an ephemeral node. The Hmaster monitors these nodes to discover available region servers, and it also monitors these nodes for server failure S. hmasters vie to create a ephemeral node. Zookeeper determines the first one and uses it to make sure that's one master is active. The active Hmaster sends heartbeats to Zookeeper, and the inactive Hmaster listens for notifications of the active Hmaster Failure.

If a region server or the active hmaster fails to send a heartbeat, the session is expired and the corresponding ephemeral node is deleted. Listeners for updates would be notified of the deleted nodes. The active Hmaster listens for region servers, and would recover region servers on failure. The Inactive Hmaster listens for active hmaster failure, and if an active hmaster fails, the Inactive Hmaster becomes ACTI Ve.

HBase First Read or Write

There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster. ZooKeeper stores the location of the META table.

Happens the first time a client reads or writes to HBase:

    1. The client gets the region server that hosts the META table from ZooKeeper.
    2. The client would query the. META. Server to get the region server corresponding to the row key it wants to access. The client caches this information along with the META table location.
    3. It'll get the Row from the corresponding region Server.

For the reads, the client uses the cache to retrieve the META-location and previously the read row keys. Over time, it does not need to query the META table, unless there was a miss because a region had moved; Then it'll re-query and update the cache.

HBase Meta Table
    • This META table was an HBase table, keeps a list of all regions in the system.
    • The. META. Table is like a b tree.
    • The. META. Table structure is as follows:-key:region start key,region id-values:regionserver

Region Servers Components

A Region Server runs in an HDFS data node and have the following components:

    • Wal:write Ahead Log is a file on the Distributed File system. The WAL is used to store new data, hasn ' t yet been persisted to permanent storage; It is used for recovery in the case of failure.
    • Blockcache:is the Read cache. It stores frequently read data in memory. Least recently used data is evicted.
    • Memstore:is the write cache. It stores new data which have not yet been written to disk. It is sorted before writing to disk. There is one memstore per column family per region.
    • Hfiles store the rows as sorted keyvalues on disk.

HBase Write Steps (1)

When the client issues a PUT request, the first step was to write the data to the Write-ahead log, the WAL:

-Edits is appended to the end of the WAL file, which is stored on disk. -The WAL is used to recover not-yet-persisted data on case a server crashes.

HBase Write Steps (2)

Once the data is written to the WAL, it's placed in the Memstore. Then, the put request acknowledgement returns to the client.

HBase Memstore

The Memstore stores updates in memory as sorted keyvalues, the same as it would is stored in an hfile. There is one memstore per column family. The updates are sorted per column family.

HBase Region Flush

When the memstore accumulates enough data, the entire sorted set are written to a new hfile in HDFS. HBase uses multiple hfiles per column family, which contain the actual cells, or KeyValue instances. These files is created over time as KeyValue edits sorted in the memstores is flushed as files to disk.

Note that this is one reason why there is a limit to the number of column families in HBase. There is one memstore per CF; When the one is full, the they all flush. It also saves the last written sequence number and the system knows what is persisted so far.

The highest sequence number is stored as a meta field in each hfile, to reflect where persisting have ended and where to co Ntinue. On region startup, the sequence number is read, and the highest are used as the sequence number for new edits.

HBase hfile

Data is stored in an hfile which contains sorted key/values. When the memstore accumulates enough data, the entire sorted KeyValue set are written to a new hfile in HDFS. This is a sequential write. It's very fast, as it avoids moving the disk drive head.

HBase hfile Structure

An hfile contains a multi-layered index which allows HBase to seek to the data without have to read the whole file. The multi-level index is like a b+tree:

    • Key value pairs is stored in increasing order
    • Indexes point by Row key to the key value data in 64KB "blocks"
    • Each block have its own leaf-index
    • The last key of each block was put in the intermediate index
    • The root index points to the intermediate index

The trailer points to the meta-blocks, and is written at the end of the persisting the data to the file. The trailer also have information like bloom filters and time range info. Bloom filters Help to skip files, that does not contain a certain row key. The time range info is useful for skipping the file if it's not in the time range the read was looking for.

hfile Index

The index, which we just discussed, is loaded when the hfile are opened and kept in memory. This allows lookups to is performed with a single disk seek.

HBase Read Merge

We have seen the KeyValue cells corresponding to one row can is in multiple places, row cells already persisted AR E in Hfiles, recently updated cells is in the memstore, and recently read cells is in the Block cache. If you read a row, how does the system get the corresponding cells to return? A Read merges Key Values from the block cache, Memstore, and Hfiles in the following steps:

    1. First, the scanner looks for the Row cells in the Block cache-the read cache. Recently Read Key Values were cached here, and Least recently used was evicted when memory was needed.
    2. Next, the scanner looks in the Memstore, the write cache in memory containing the most recent writes.
    3. If the scanner does not find all of the row cells in the Memstore and Block caches, then HBase would use the block cache ind EXEs and Bloom filters to load hfiles in memory, which may contain the target row cells.

HBase Read Merge

As discussed earlier, there may is many hfiles per Memstore, which means for a read, multiple files may has to be examine D, which can affect the performance. This is called read amplification.

HBase Minor Compaction

HBase would automatically pick some smaller hfiles and rewrite them into fewer bigger hfiles. This process is called minor compaction. Minor compaction reduces the number of storage files by rewriting smaller files to fewer but larger ones, performing a M Erge sort.

HBase Major Compaction

Major compaction merges and rewrites all the hfiles of a region to one hfile per column family, and in the process, drops deleted or expired cells. This improves read performance; However, since major compaction rewrites all of the files, lots of disk I/O and network traffic might occur during the pro Cess. This is called write amplification.

Major Compactions can is scheduled to run automatically. Due to write amplification, major compactions is usually scheduled for weekends or evenings. Note that mapr-db have made improvements and does not need to do compactions. A Major compaction also makes any data files, were remotes, due to server failure or load balancing, local to the Regio N Server.

region = contiguous Keys

Let's do a quick review of regions:

    • A table can be divided horizontally to one or more regions. A region contains a contiguous, sorted range of rows between a start key and an end key
    • Each of the 1GB in size (default)
    • A region of a table was served to the client by a Regionserver
    • A region server can serve about $ regions (which may belong to the same table or different tables)

Region Split

Initially there is a region per table. When a region grows too large, it splits into the child regions. Both child regions, representing one-half of the original region, is opened in parallel on the same region server, and th En the split is reported to the Hmaster. For load balancing reasons, the Hmaster is schedule for new regions to being moved off to other servers.

Read Load Balancing

Splitting happens initially on the same region server, but for load balancing reasons, the Hmaster could schedule for new re Gions to is moved off to other servers. This results in the new region server serving data from a remote HDFS node until a major compaction moves the data files T o The Regions server ' s local node. HBase data is local if it is written, if A is moved (for load balancing or recovery), it's not local until Major compaction.

HDFS Data Replication

All writes and Reads is to/from the primary node. HDFS replicates the WAL and hfile blocks. hfile block replication happens automatically. HBase relies on HDFS to provide the data safety as it stores it files. When data was written in HDFS, one copy was written locally, and then it was replicated to a secondary node, and a third copy is written to a tertiary node.

HDFS Data Replication (2)

The WAL file and the hfiles is persisted on disk and replicated, so what does HBase recover the Memstore updates not Persi Sted to Hfiles? See the next sections for the answer.

HBase Crash Recovery

When a regionserver fails, Crashed regions is unavailable until detection and recovery steps have happened. Zookeeper would determine Node failure when it loses region server heart beats. The hmaster would then be notified, the region Server had failed.

When the Hmaster detects. A region server had crashed, the Hmaster reassigns the regions from the crashed server to AC tive region servers. In order to recover the crashed region server's Memstore edits that were not flushed to disk. The Hmaster splits the WAL belonging to the crashed region servers into separate files and stores these file in the new Reg Ion servers ' data nodes. Each of the region servers then replays the "Wal from the respective split Wal," to rebuild the Memstore for.

Data Recovery

WAL files contain a list of edits, with one edit representing a single put or delete. Edits was written chronologically, so, for persistence, additions be appended to the end of the WAL file that's stored O n Disk.

What happens if there was a failure when the data was still in memory and not persisted to an hfile? The WAL is replayed. Replaying a Wal is the reading of the Wal, adding and sorting the contained edits to the current memstore. At the end, the Memstore are flush to write changes to an hfile.

Apache HBase Architecture Benefits

HBase provides the following benefits:

    • Strong consistency Model-when A write returns, all readers would see same value
    • Scales automatically-regions split when data grows too large-uses HDFS to spread and replicate data
    • Built-in recovery-using Write Ahead Log (similar to journaling on file system)
    • Integrated with Hadoop-mapreduce in HBase is straightforward
Apache HBase has problems Too ...
    • Business Continuity reliability:-WAL Replay Slow-slow complex crash recovery-major compaction I/O storms
Mapr-db with MAPR-FS does not has these problems

The diagram below compares the application stacks for Apache hbase on top of HDFS in the left, Apache HBase on top of MapR ' s Read/write file system Mapr-fs in the middle, and Mapr-db and Mapr-fs in a Unified Storage Layer on the right.

MAPR-DB exposes the same hbase API and the Data model for MAPR-DB are the same as for Apache HBase. However the MAPR-DB implementation integrates table storage into the MapR file system, eliminating all JVM layers and inte Racting directly with disks for both file and table storage.

MAPR-DB offers many benefits over HBase, while maintaining the virtues of the HBase API and the idea of data being sorted According to primary key. MAPR-DB provides operational benefits such as no compaction delays and automated region splits that does not impact the perf Ormance of the database. The tables in mapr-db can also is isolated to certain machines in a cluster by utilizing the topology feature of MapR. The final differentiator is this mapr-db is just plain fast, and due primarily to the fact it's tightly integrated into The MapR file system itself, rather than being layered on top of a distributed file system, was layered on top of a con Ventional file System.

Key differences between MAPR-DB and Apache HBase

    • Tables part of the MapR read/write File system
      • Guaranteed Data locality
    • Smarter Load Balancing
      • Uses Container Replicas
    • Smarter fail over
      • Uses Container Replicas
    • Multiple small WALs
      • Faster Recovery
    • Memstore flushes merged into Read/write File System
      • No compaction!
Demand training to learn more about MAPR-FS and mapr-db

In this blog post, you learned more about the HBase architecture and its main benefits-over NoSQL data store solutions. If you had any questions about HBase, please ask them in the comments section below.

An in-depth look at the HBase Architecture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.