HBase is an ApacheHadoop database that provides random and real-time read/write access to large data. HBase aims to store and process large data. HBase is an open source
HBase is an Apache Hadoop database that provides random and real-time read/write access to large data. HBase aims to store and process large data. HBase is an open source
HBase is an Apache Hadoop database that provides random and real-time read/write access to large data. HBase aims to store and process large data. HBase is an open-source, distributed, multi-version, column-oriented storage model. It stores loose data.
HBase features:
1. High Reliability
2 Efficiency
3. Column Orientation
4. scalable
5. You can build a large-scale structured storage cluster on a cheap PC Server.
HBase is an open-source implementation of Google BigTable, which corresponds to the following:
Google HBase
File Storage System GFS HDFS
Massive Data Processing MapReduce Hadoop MapReduce
Collaborative Service Management Chubby Zookeeper
HBase relationship diagram:
HBase is located on the structured storage layer. HBase supports HBase by components:
Functions of Hadoop Components
High-reliability underlying storage support for HDFS
MapReduce high-performance computing capability
Zookeeper stable service and failover Mechanism
Pig & Hive high-level language support for data statistics
Sqoop provides RDBMS data import to facilitate migration of traditional databases to HBase
HBase access interface
Mode features occasion
Native Java API: The most common and efficient Hadoop MapReduce Job processes HBase table data in parallel
HBase Shell is the simplest interface for HBase management and use.
Thrift Gateway uses Thrift serialization to support online access to HBase table data in heterogeneous systems in multiple languages
Rest Gateway removes language restrictions on access to Rest-style Http APIs
Pig Latin 60 Programming Language Processing Data Statistics
Hive is simple, SqlLike
HBase Data Model
Component Description:
Row Key: Table Primary Key Row Key Table records are sorted by Row Key
Timestamp: The Timestamp corresponding to each data operation, that is, the version number of the data.
Column Family: Column clusters. A table has one or more Column clusters in the horizontal direction. The Column clusters can be composed of any number of columns. The Column clusters support dynamic expansion, you do not need to specify the quantity and type, binary storage, and type conversion.
Table & Region
1. As the number of records increases, the Table is automatically split into multiple Splits and becomes the Regions
2. A region is represented by [startkey, endkey ).
3. Different region will be allocated to the corresponding RegionServer by the Master for management.
Two special tables:-ROOT-&. META.
. META. Record the Region information of the User table. At the same time,. META. can also have multiple region
-ROOT-records the Region information of the. META. Table, but-ROOT-only one region
The location of the-ROOT-table is recorded in Zookeeper.
The process of accessing data from the client:
Client-> Zookeeper->-ROOT->. META.-> User data table
Multiple network operations, but the client has cache
HBase System Architecture
Component Description
Client:
Use HBase RPC mechanism to communicate with HMaster and HRegionServer
The Client communicates with the HMaster to perform management operations.
Client and HRegionServer perform data read/write operations
Zookeeper:
Zookeeper Quorum storage-ROOT-Table address, HMaster address
HRegionServer registers itself as Ephedral to Zookeeper. The HMaster can detect the health status of each HRegionServer at any time.
Zookeeper avoids HMaster spof
HMaster:
There is no single point of failure in HMaster. Multiple hmasters can be started in HBase. The Zookeeper Master Election mechanism ensures that one Master is always running.
Mainly responsible for Table and Region management:
1. Manage the addition, deletion, modification, and query operations on tables.
2. Manage the load balancing of HRegionServer and adjust the Region distribution
3. After Region is Split, it is responsible for the distribution of the new Region.
4. After the HRegionServer is down, it is responsible for migrating Region on the HRegionServer that fails.
HRegionServer:
The most core module in HBase is responsible for responding to user I/O requests and reading and writing data to the HDFS file system.
HRegionServer manages HRegion objects in some columns;
Each HRegion corresponds to a Region in the Table. HRegion consists of multiple hstores;
Each HStore corresponds to the storage of a Column Family in the Table;
Column Family is a centralized storage unit. Therefore, it is more efficient to put columns with the same IO Characteristics in a Column Family.
HStore:
The core of HBase storage. It consists of MemStore and StoreFile.
MemStore is a Sorted Memory Buffer. Data Writing Process:
Client write-> Save to MemStore, until MemStore is full-> Flush into a StoreFile, until it reaches a certain threshold-> Start Compact merge operations-> Merge multiple storefiles into a StoreFile, merge versions and delete data at the same time-> when StoreFiles are Compact, a larger StoreFile is gradually formed-> when the size of a single StoreFile exceeds a certain threshold, the Split operation is triggered, split the current Region into two Region instances, and the Region instances will go offline. The new two shards will be allocated to the corresponding HRegionServer by the HMaster, this allows the original pressure of one Region to be distributed to two Region instances.
From this process, we can see that HBase only adds data and performs some update and delete operations in the Compact phase. Therefore, you only need to enter the memory for User write operations to return immediately, this ensures high I/O performance.