Region
Region is the basic unit of HBASE data management. The move of the data, the balance of the data and the split of the data are all operated according to the region.
Region stores the real data for this user, and in order to manage the data, HBase uses Regionsever to manage region.
Addressing Process
The general process of data addressing is as follows, please refer to:
Zookeeper Hbase:meta Table Table +--------+ +-------------- + +--------------+
| | ----------> | | ---+ |
| +--------+ +--------------+ | +--------------+ Hbase:meta | | | |
| Location +--------------+ | +--------------+
| | +-----> |
Row | +--------------+ +--------------+ Row per table region |
| +--------------+
|
|
+--------------+ |
| +--------------+
1. The location that manages Hbase:meta as master on zookeeper node nodes
2. The client obtains the address of region server via zookeeper
3. After the region information is obtained, the information of the data can be obtained
4. Client returns query results
Region name
HBase's region name is made up of the following three sections:
Usertablename +, + Startkey +, + RegionID
And RegionID is randomly generated by apache.org, specifically TIMESTAMP+.+MD5
Like what:
test1,r6786520,1456410376247.fc9bdcb4f88aec2e64b393fece99cf0e.
test1:Table name
r6786520: startkey
1456410376247.fc9bdcb4f88aec2e64b393fece99cf0e: regionid
1456410376247:The numeric type of the timestamp type that turns long
fc9bdcb4f88aec2e64b393fece99cf0e: ID generated by MD5 encryption algorithm
Number of Region
In general, if the number of region without prior use of hbase Shell to express the definition of the system, the number of region, generally only 3:
1. Hbase:meta
2. Hbase:namespace
3. Userregion
Because the default region size is 10G, in a small environment, the amount of data is difficult to quickly reach the threshold of data splitting.
Express to specify, region number can be based on the trend of their own business to achieve a select peak, so in addition to the design of an excellent rowkey, the data distribution is more balanced, the performance of the entire cluster is the best.
Doubts:
1. In the case of specifying region number beforehand, the Startkey chosen by region Partition point is based on what choice, this does not understand. The actual test found that when the region number was specified as 5 o'clock, the distribution of Startkey~endkey was as follows:
region1:-#INF ~33333333
region2:33333333~66666666
region3:66666666~99999999
region4:99999999~CCCCCCCC
region5:cccccccc~+# Inf
Rowkey
HBase data distribution and the operation of the data are based on rowkey to divide, if the rowkey design unreasonable, then the data will be distributed on a region, resulting in uneven load, IO request intensified, the user's experience felt an instant decline, the delay increased. Therefore, the general Rowkey does not recommend the use of timestamp, letters and other mixed, preferably with a hash hase, the use of MD5, such as the generation, so that the data distribution is balanced.