I. Overview of HBase
Overview
HBase is a distributed Columnstore system built on HDFs;
HBase is a typical key/value system developed based on the Google BigTable model;
HBase is an important part of the Apache Hadoop ecosystem, and is used primarily for massive structured data storage;
In a logical sense, hbase stores data in tables, rows, and columns.
Like Hadoop, hbase targets rely primarily on scale-out to increase compute and storage capacity by increasing the number of inexpensive commercial servers.
Features of HBase tables
Large: A table can have billions of rows, millions of columns;
Modeless: Each row has a sortable primary key and any number of columns, the columns can be dynamically increased as needed, and different rows in the same table can have distinct columns;
Column-oriented: column-oriented (family) storage and permission control, column (family) independent retrieval;
Sparse: Empty (NULL) columns do not occupy storage space, tables can be designed very sparse;
Multiple versions of data: The data in each cell can have multiple versions, and by default the version number is automatically assigned, which is the timestamp when the cell is inserted;
Data type singleton: Data in HBase is a string and has no type.
HBase Physical Model
1. All rows in table are arranged according to the dictionary order of row key;
2. Table is divided into multiple region in the direction of the row;
3, region by size division, each table began only a region, with the increase of data, region is increasing, when the increase to a threshold, region will wait for the chapter two new region, then there will be more and more region;
4, Region is the smallest unit of distributed storage and load balancing in HBase, and different region distributes to different regionserver.
5, region, although the smallest unit of distributed storage, but not the smallest unit of storage. Region consists of one or more stores, each store a columns family; Each strore consists of one memstore and 0 to more storefile, StoreFile contains hfile The memstore is stored in memory and StoreFile is stored on HDFs.
ii. hbase Architecture and basic components
Client
Contains an interface that accesses HBase and maintains the cache to expedite access to hbase, such as location information for region.
The hmaster node is used to:
1. Manage Hregionserver to achieve load balancing.
2. Manage and assign hregion, such as assigning new hregion at Hregion split, and migrating hregion to other hregionserver on Hregionserver exit.
3. Implement DDL Operations (Data Definition language,namespace and table additions and deletions, column familiy additions and deletions, etc.).
4. Manage metadata for namespace and table (actually stored on HDFS).
5. Permission control (ACL).
The hregionserver node is used to:
1. Store and manage local hregion.
2. Read and write HDFs to manage the data in the table.
3.Client reads and writes data directly from the Hregionserver (the metadata is obtained from the Hmaster, and the hregion/hregionserver after the Rowkey is located)
The zookeeper cluster is a coordinated system for:
1. Store metadata for the entire hbase cluster and status information for the cluster.
2. Implement the failover of Hmaster master-slave node.
HBase client communicates via RPC with Hmaster, Hregionserver, and a hregionserver can hold 1000 hregion; the underlying table data is stored in HDFs. and hregion the data as far as possible with the data in the Datanode together, to achieve localization of data; Data localization is not always possible, such as in the case of a hregion move (for example, split), the next compact is required to continue back to localization.
This architecture diagram clearly expresses the hmaster and Namenode support multiple hot backup, use zookeeper to do coordination; Zookeeper is not cloud-like mystery, it is generally composed of three machines a cluster, The internal use of the Paxos algorithm supports an outage in three servers, as well as the use of five machines, which can support simultaneously two down-time, less than half of the downtime, but as the machine increases, so does its performance. Regionserver and Datanode are typically placed on the same server for localizing data.
hregion
HBase uses Rowkey to cut the table horizontally into multiple hregion, from the hmaster angle, Each hregion records its Startkey and EndKey (the first hregion Startkey is empty, the last Hregion is empty), because EndKey is sorted, Thus the client can quickly locate each rowkey in which hregion by Hmaster. The hregion is assigned to the appropriate hregionserver by the Hmaster, which is then responsible for hregion startup and management by the Hregionserver, and the client communication, which is responsible for reading the data (using HDFS). Each hregionserver can manage 1000 or so hregion at the same time (this number is how to come.) It is out of experience to see no limitations from the code. More than 1000 can cause performance problems. To answer this question: the feeling that this 1000 figure is from BigTable's paper (5 implementation): Each tablet server manages a set of tablets (typically we have some Where between ten to a thousand tablets per tablet server).
Hmaster
Hmaster There is no single point of failure, you can start multiple hmaster by Zookeeper's master The election mechanism guarantees that only one hmaster is in the active state, while the other hmaster is in a hot backup state. Typically, two hmaster are started, and non-active hmaster periodically communicate with the active hmaster to get their latest status, ensuring that it is updated in real time, so that if more than one hmaster is started, it increases the active The burden of Hmaster. The previous article has introduced Hmaster's main use for hregion allocation and management, DDL (Data Definition Language, both table new, delete, modify, etc.) implementation, etc., it has two main responsibilities:
1. Coordinates hregionserver
(1). Hregion allocation at startup, and hregion redistribution when load balancing and repair.
(2). Monitor the status of all Hregionserver in the cluster (through heartbeat and monitoring the status in the zookeeper).
2.Admin functions
(1). Create, delete, and modify the definition of a table.