HBase Basic Concepts

Source: Internet
Author: User
Original Blog Address: http://blog.csdn.net/woshiwanxin102213/article/details/17584043 overview

HBase is a distributed storage system built on HDFS;
HBase is a typical key/value system developed based on the Google BigTable model.
HBase is an important member of the Apache Hadoop ecosystem, which is mainly used for storing large amounts of structured data.
Logically, HBase stores data according to tables, rows, and columns.
Like Hadoop, the HBase target relies heavily on scaling to increase computing and storage capabilities by increasing the number of inexpensive business servers.
Characteristics of HBase table
Big: A table can have billions of rows, millions of columns;
Modeless: Each row has a sortable primary key and any number of columns, the columns can be dynamically increased as needed, and different rows in the same table can have distinct columns;
Column oriented: column (family)-oriented storage and permission control, column (family) independent search;
Sparse: Empty (NULL) columns do not occupy storage space, the table can be designed very sparse;
Multiple versions of data: The data in each cell can have multiple versions, and by default the version number is automatically allocated, which is the time stamp when the cell is inserted;
Data type Single: The data in HBase is a string and has no type.
HBase Data Model

hbase Logical View

Note The English description in the above picture

hbase Basic Concepts Rowkey: Is the byte array, is the table each record "The primary Key", facilitates the quick search, the Rowkey design is very important.
Column Family: A family of columns with a name (string) that contains one or more related columns
Column: Belongs to a columnfamily,familyname:columnname, each record can be dynamically added
Version Number: The type is long, the default is the system timestamp, and can be customized by the user
Value (Cell): Byte array
hbase Physical model Each column family is stored in a separate file on the HDFs, and the null value is not saved.
Key and Version number are available in each column family;
HBase maintains multilevel indexes for each value, namely: <key, column family, column name, timestamp>

Physical storage:
1. All rows in table are arranged in the dictionary order of row key;
2, table row in the direction of the split into multiple region;
3, region by size, each table began with only a region, with the increase in data, region constantly increase, when the increase to a threshold, region will be such as branch two new region, then there will be more and more region;
4. Region is the smallest unit of distributed storage and load balancing in HBase, and different region are distributed to different regionserver.


5. Although region is the smallest unit of distributed storage, it is not the smallest unit of storage. The region consists of one or more stores, each of which holds a columns family, and each strore consists of one memstore and 0 to many storefile, StoreFile contains hfile ; memstore is stored in memory and StoreFile is stored on HDFs.


hbase Architecture and basic components

hbase Basic Components Description:

Client

 includes access to HBase interfaces, and maintains cache to expedite access to hbase, such as region location information

Master

 Assign region to region server

 load balancing for Region server

 Discover the defunct region server and reassign the region on it

 Admin user to table additions and deletions to check operations

Region Server

regionserver maintenance region, processing IO requests for these region

regionserver is responsible for splitting the region that becomes too large during the operation.

Zookeeper function

 by election to ensure that at any time, only one master,master in the cluster and regionservers will be registered with the zookeeper when it starts

 Storage of all region addressing portals

 real-time monitoring of Region server's online and offline information. and notify Master in real time

 Storage of hbase schema and table meta data

 By default, HBase manages zookeeper instances, such as starting or stopping zookeeper

The introduction of zookeeper makes master no longer a single point of failure


Write-ahead-log (WAL)


This mechanism is used for fault tolerance and recovery of data:

Each hregionserver has a Hlog object, Hlog is a class that implements write ahead log, and writes a copy of the data to the Memstore file (hlog file format for follow-up) every time the user writes to Hlog. The Hlog file periodically scrolls out of the new and deletes the old file (data that has been persisted to storefile). When Hregionserver unexpectedly terminated, Hmaster will perceive through zookeeper, Hmaster will first process the legacy hlog files, which will be divided into different region log data, respectively, placed in the corresponding region directory, The region is then reassigned, and the region is picked up hregionserver in the load region process, there will be a history hlog need to be processed, so replay the data in Hlog to Memstore, Then flush to Storefiles to complete the data recovery

fault tolerance of hbase
Master Fault Tolerant: Zookeeper Select a new master
 No master process, the data read still proceed as usual;
 No master process, region segmentation, load balancing, etc. can not be carried out;
Regionserver Fault tolerance: periodic report to Zookeeper Heartbeat, if the heartbeat does not occur in time, master reassign the region on the regionserver to other Regionserver, on the failed server "write" The log is split by the primary server and sent to the new Regionserver
Zookeeper fault tolerance: Zookeeper is a reliable service, typically configured with 3 or 5 zookeeper instances
Region positioning Process:

Looking for Regionserver

Zookeeper-->-root-(single region)-->. meta.--> User table -root-
 the table contains. A region list of the tables that contains only one region;

The location of the-root-table is recorded in the zookeeper. . META.

 The table contains all the user-space region lists, as well as the Regionserver server addresses. hbase Use scenes

Storing large amounts of data (100s of TBs)
Need high write throughput
Need efficient random access (key lookups) within large data sets
Need to scale gracefully with data
For structured and semi-structured data
Don ' t need FULLRDMS capabilities (cross Row/cross table Transaction, joins,etc.)

Large data storage, high data volume, concurrent operation

Require random read and write operations on data

Read and write access are very simple operations HBase and HDFs contrast both have good fault tolerance and scalability, can be extended to hundreds of nodes;
HDFs for batch Scenarios
Random lookup of data is not supported
Not suitable for incremental data processing

Data Update not supported



Reference Documentation:

1, http://www.alidata.org/archives/1509 (storage model more detailed)

2, http://www.searchtb.com/2011/01/understanding-hbase.html (technical framework and storage model)

3, http://wenku.baidu.com/view/b46eadd228ea81c758f578f4.html (read and write process more detailed)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.