Original Blog Address: http://blog.csdn.net/woshiwanxin102213/article/details/17584043 overview
HBase is a distributed storage system built on HDFS;
HBase is a typical key/value system developed based on the Google BigTable model.
HBase is an important member of the Apache Hadoop ecosystem, which is mainly used for storing large amounts of structured data.
Logically, HBase stores data according to tables, rows, and columns.
Like Hadoop, the HBase target relies heavily on scaling to increase computing and storage capabilities by increasing the number of inexpensive business servers.
Characteristics of HBase table
Big: A table can have billions of rows, millions of columns;
Modeless: Each row has a sortable primary key and any number of columns, the columns can be dynamically increased as needed, and different rows in the same table can have distinct columns;
Column oriented: column (family)-oriented storage and permission control, column (family) independent search;
Sparse: Empty (NULL) columns do not occupy storage space, the table can be designed very sparse;
Multiple versions of data: The data in each cell can have multiple versions, and by default the version number is automatically allocated, which is the time stamp when the cell is inserted;
Data type Single: The data in HBase is a string and has no type.
HBase Data Model
hbase Logical View
Note The English description in the above picture
hbase Basic Concepts Rowkey: Is the byte array, is the table each record "The primary Key", facilitates the quick search, the Rowkey design is very important.
Column Family: A family of columns with a name (string) that contains one or more related columns
Column: Belongs to a columnfamily,familyname:columnname, each record can be dynamically added
Version Number: The type is long, the default is the system timestamp, and can be customized by the user
Value (Cell): Byte array
hbase Physical model Each column family is stored in a separate file on the HDFs, and the null value is not saved.
Key and Version number are available in each column family;
HBase maintains multilevel indexes for each value, namely: <key, column family, column name, timestamp>
Physical storage:
1. All rows in table are arranged in the dictionary order of row key;
2, table row in the direction of the split into multiple region;
3, region by size, each table began with only a region, with the increase in data, region constantly increase, when the increase to a threshold, region will be such as branch two new region, then there will be more and more region;
4. Region is the smallest unit of distributed storage and load balancing in HBase, and different region are distributed to different regionserver.
5. Although region is the smallest unit of distributed storage, it is not the smallest unit of storage. The region consists of one or more stores, each of which holds a columns family, and each strore consists of one memstore and 0 to many storefile, StoreFile contains hfile ; memstore is stored in memory and StoreFile is stored on HDFs.
hbase Architecture and basic components
hbase Basic Components Description:
Client
includes access to HBase interfaces, and maintains cache to expedite access to hbase, such as region location information
Master
Assign region to region server
load balancing for Region server
Discover the defunct region server and reassign the region on it
Admin user to table additions and deletions to check operations
Region Server
regionserver maintenance region, processing IO requests for these region
regionserver is responsible for splitting the region that becomes too large during the operation.
Zookeeper function
by election to ensure that at any time, only one master,master in the cluster and regionservers will be registered with the zookeeper when it starts
Storage of all region addressing portals
real-time monitoring of Region server's online and offline information. and notify Master in real time
Storage of hbase schema and table meta data
By default, HBase manages zookeeper instances, such as starting or stopping zookeeper
The introduction of zookeeper makes master no longer a single point of failure
Write-ahead-log (WAL)
This mechanism is used for fault tolerance and recovery of data:
Each hregionserver has a Hlog object, Hlog is a class that implements write ahead log, and writes a copy of the data to the Memstore file (hlog file format for follow-up) every time the user writes to Hlog. The Hlog file periodically scrolls out of the new and deletes the old file (data that has been persisted to storefile). When Hregionserver unexpectedly terminated, Hmaster will perceive through zookeeper, Hmaster will first process the legacy hlog files, which will be divided into different region log data, respectively, placed in the corresponding region directory, The region is then reassigned, and the region is picked up hregionserver in the load region process, there will be a history hlog need to be processed, so replay the data in Hlog to Memstore, Then flush to Storefiles to complete the data recovery
fault tolerance of hbase
Master Fault Tolerant: Zookeeper Select a new master
No master process, the data read still proceed as usual;
No master process, region segmentation, load balancing, etc. can not be carried out;
Regionserver Fault tolerance: periodic report to Zookeeper Heartbeat, if the heartbeat does not occur in time, master reassign the region on the regionserver to other Regionserver, on the failed server "write" The log is split by the primary server and sent to the new Regionserver
Zookeeper fault tolerance: Zookeeper is a reliable service, typically configured with 3 or 5 zookeeper instances
Region positioning Process:
Looking for Regionserver
Zookeeper-->-root-(single region)-->. meta.--> User table -root-
the table contains. A region list of the tables that contains only one region;
The location of the-root-table is recorded in the zookeeper. . META.
The table contains all the user-space region lists, as well as the Regionserver server addresses. hbase Use scenes
Storing large amounts of data (100s of TBs)
Need high write throughput
Need efficient random access (key lookups) within large data sets
Need to scale gracefully with data
For structured and semi-structured data
Don ' t need FULLRDMS capabilities (cross Row/cross table Transaction, joins,etc.)
Large data storage, high data volume, concurrent operation
Require random read and write operations on data
Read and write access are very simple operations HBase and HDFs contrast both have good fault tolerance and scalability, can be extended to hundreds of nodes;
HDFs for batch Scenarios
Random lookup of data is not supported
Not suitable for incremental data processing
Data Update not supported
Reference Documentation:
1, http://www.alidata.org/archives/1509 (storage model more detailed)
2, http://www.searchtb.com/2011/01/understanding-hbase.html (technical framework and storage model)
3, http://wenku.baidu.com/view/b46eadd228ea81c758f578f4.html (read and write process more detailed)