HBase is the Open-source implementation of Google BigTable, using Hadoop HDFs as its file storage system, using Hadoop MapReduce to handle the massive data in HBase, using zookeeper as a collaborative service.
1. Introduction
HBase is a distributed, column-oriented open source database, rooted in a Google paper, BigTable: A distributed storage system with structured data. HBase is the Open-source implementation of Google BigTable, using Hadoop HDFs as its file storage system, using Hadoop MapReduce to handle the massive data in HBase, using zookeeper as a collaborative service.
2. Table Structure of HBase
HBase stores data as a table. The table is composed of rows and columns. Columns are divided into a number of column families/clusters (column family).
Row Key |
Column-family1 |
Column-family2 |
Column-family3 |
Column1 |
Column2 |
Column1 |
Column2 |
Column3 |
Column1 |
Key1 |
|
|
|
|
|
|
Key2 |
|
|
|
|
|
|
Key3 |
As shown in the figure above, Key1,key2,key3 is the only row key value for three records, and Column-family1,column-family2,column-family3 is a three-column family with several columns under each column family. For example column-family1 This column family consists of two columns, the name is Column1 and COLUMN2,T1:ABC,T2:GDXDF is uniquely identified by row Key1 and column-family1-column1 a cell. There are two data in this cell, ABC and GDXDF. The two-value timestamp is different, t1,t2, and HBase returns the value of the most recent time to the requester.
The specific meanings of these nouns are as follows:
(1) Row Key
Like the NoSQL database, row key is the primary key used to retrieve records. There are only three ways to access the rows in HBase table:
(1.1) Access through a single row key
(1.2) Range by row key
(1.3) Full table scan
Row key can be any string (the maximum length is 64KB, the length of the actual application is generally 10-100bytes), and within HBase, the row key is saved as a byte array.
When stored, the data is sorted according to the dictionary order of the row key (byte orders). When you design a key, you want to fully sort the storage feature and put together the row stores that are often read together. (Position Dependencies)
Attention:
The result of the dictionary order for int is 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,..., 9,91,92,93,94,95,96,97,98,99. To maintain the natural order of the reshaping, the line keys must be filled with 0 left.
One read and write to a row is an atomic operation (regardless of how many columns are read or written). This design decision makes it easy for the user to understand the behavior of the program when the concurrent update operation is performed on the same row.
(2) Row Family column family
Each column in the HBase table belongs to a column family. The column family is part of the Chema of the table (and the column is not) and must be defined before the table is used. The column names are prefixed by the column family. For example, Courses:history, Courses:math all belong to the courses clan.
Access control, disk, and memory usage statistics are performed at the column family level. In practical applications, control permissions on the column family can help us manage different types of applications: we allow some applications to add new basic data, some applications can read basic data and create inherited column families, and some applications only allow browsing of data (and may even be unable to browse all data for privacy reasons).
(3) Unit cell
In HBase, a cell is known as a storage unit by row and columns. The only cells identified by {row key, column (=<family> + <label>), version}. The data in the cell is of no type, and is all byte-code form storage.
(4) Time stamp timestamp
Each cell holds multiple versions of the same data. The version is indexed by the timestamp. The timestamp type is a 64-bit integral type. The timestamp can be assigned by HBase (automatically when data is written), at which time the timestamp is the current system time that is accurate to milliseconds. The timestamp can also be explicitly assigned by the customer. If your application wants to avoid data version conflicts, you must generate a unique timestamp yourself. In each cell, different versions of the data are sorted in reverse chronological order, that is, the most recent data is in the front.
To avoid the burden of management (including storage and indexing) caused by excessive versions of data, HBase provides two ways to recycle data versions. The first is to save the last n versions of the data, and the second is to save the latest version (for example, the last seven days). Users can set up for each column family.
3. Basic usage of HBase shell
HBase provides a shell terminal to interact with the user. Use the command hbase Shell to enter the command interface. You can see the Help information for the command by performing the assist.
Use an example of a student's score sheet on the web to demonstrate hbase usage.
Name |
Grad |
Course |
Math |
Art |
Tom |
5 |
97 |
87 |
Jim |
4 |
89 |
80 |
Here grad for the table is a only its own column family, course for the table is a two-column family of columns, which consists of two columns of math and art, of course, we can according to our needs in course to build more family, such as computer, Physics and other corresponding columns added to the course column family.
(1) Establishment of a table scores, with two grad and Courese
Copy Code code as follows:
HBase (main):001:0> create ' scores ', ' Grade ', ' course '
You can use the list command to see which tables are in the current hbase. Use the describe command to view the table structure. (Remember all the indicated, column names need to be enclosed in quotes)
(2) Insert values by Design table structure:
Copy Code code as follows:
Put ' scores ', ' Tom ', ' Grade: ', ' 5′
Put ' scores ', ' Tom ', ' course:math ', ' 97′
Put ' scores ', ' Tom ', ' course:art ', ' 87′
Put ' scores ', ' Jim ', ' Grade ', ' 4′
Put ' scores ', ' Jim ', ' Course: ', ' 89′
Put ' scores ', ' Jim ', ' Course: ', ' 80′
So the table structure is up, in fact, relatively free, the column family inside can be free to add a child column is very convenient. If there are no child columns under the column family, it is OK to add without a colon.
The put command is simpler, with only one usage:
Hbase> put ' t1′, ' r1′, ' c1′, ' value ', ts1
T1 refers to the table name, R1 refers to the row key name, C1 refers to the column name, value refers to the cell value. Ts1 refers to the time stamp, generally omitted.
(3) query data based on key values
Get ' scores ', ' Jim '
Get ' scores ', ' Jim ', ' Grade '
Maybe you found a pattern, hbase shell operation, a general order is the operation of the keyword followed by the table name, row name, column name such an order, if there are other conditions with curly braces plus.
Get has the following usage:
Hbase> get ' t1′, ' r1′
Hbase> get ' t1′, ' r1′, {timerange => [Ts1, TS2]}
Hbase> get ' t1′, ' r1′, {COLUMN => ' c1′}
Hbase> get ' t1′, ' r1′, {COLUMN => [' C1 ', ' C2 ', ' C3 ']}
Hbase> get ' t1′, ' r1′, {COLUMN => ' c1′, TIMESTAMP => ts1}
Hbase> get ' t1′, ' r1′, {COLUMN => ' c1′, Timerange => [Ts1, TS2], versions => 4}
Hbase> get ' t1′, ' r1′, {COLUMN => ' c1′, TIMESTAMP => ts1, versions => 4}
Hbase> get ' t1′, ' r1′, ' c1′
Hbase> get ' t1′, ' r1′, ' c1′, ' c2′
Hbase> get ' t1′, ' r1′, [' C1 ', ' C2 ']
(4) Scan all data
Scan ' scores '
You can also specify some modifiers: Timerange, FILTER, LIMIT, StartRow, Stoprow, TIMESTAMP, Maxlength,or COLUMNS. There are no modifiers, just the top sentences, and all the rows of data are displayed.
Examples are as follows:
Copy Code code as follows:
Hbase> Scan '. META. '
Hbase> Scan '. META. ', {COLUMNS => ' info:regioninfo '}
Hbase> Scan ' t1′, {COLUMNS => [' C1 ', ' C2 '], LIMIT =>, StartRow ' XYZ '}
Hbase> Scan ' t1′, {COLUMNS => ' c1′, Timerange => [1303668804, 1303668904]}
Hbase> Scan ' t1′, {FILTER => ' (Prefixfilter (' row2′) and (Qualifierfilter (>=, ' binary:xyz '))) and (Timestampsfil TER (123, 456)) "}
Hbase> Scan ' t1′, {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new (1, 0)}
Filter filter has two ways of pointing out:
A. Using a filterstring–more information on this is available in the
Filter Language document attached to the HBASE-4176 JIRA
B. Using the entire package name of the filter.
There is also a cache_blocks modifier, switch scan cache, the default is open (Cache_blocks=>true), you can choose to close (Cache_blocks=>false).
(5) Delete specified data
Copy Code code as follows:
Delete ' scores ', ' Jim ', ' Grade '
Delete ' scores ', ' Jim '
The delete data command does not change too much, only one:
hbase> Delete ' t1′, ' r1′, ' c1′, ts1
There is also a DeleteAll command, you can do the entire line of the range of delete operations, use caution!
If you need to do a full table delete operation, use the TRUNCATE command, in fact, there is no direct full table Delete command, this command is also disable,drop,create three of commands combined.
(6) Modify table structure
Copy Code code as follows:
Disable ' scores '
Alter ' scores ',name=> ' info '
Enable ' scores '
The ALTER command uses the following (in the case of a successful version, a generic table disable is required):
A, change or add a column family:
Hbase> Alter ' t1′, NAME => ' f1′, versions => 5
b, delete a column family:
Copy Code code as follows:
Hbase> Alter ' t1′, NAME => ' f1′, method => ' delete '
Hbase> alter ' t1′, ' delete ' => ' f1′
C, you can also modify table properties such as Max_filesize
Memstore_flushsize, READONLY, and Deferred_log_flush:
Hbase> Alter ' t1′, method => ' Table_att ', max_filesize => ' 134217728′
D, you can add a table to collaborate with the processor
Hbase> Alter ' t1′, method => ' Table_att ', ' coprocessor ' => ' arg1=1,arg2=2′
A table can be configured with multiple collaboration processors, and a sequence will automatically grow to identify. The following rules are required to load a collaborative processor (which can be said to be a filtering program):
[coprocessor jar file Location] | Class name | [Priority] | [Arguments]
E, remove coprocessor as follows:
Hbase> Alter ' t1′, method => ' Table_att_unset ', NAME => ' max_filesize '
Hbase> Alter ' t1′, method => ' Table_att_unset ', NAME => ' coprocessor$1′
F, you can execute multiple ALTER commands at once:
Hbase> Alter ' t1′, {name => ' f1′}, {name => ' f2′, method => ' delete '}
(7) Statistics of the number of lines:
Copy Code code as follows:
hbase> Count ' t1′
hbase> count ' t1′, INTERVAL => 100000
hbase> count ' t1′, CACHE => 1000
hbase> count ' t1′, INTERVAL =>, CACHE => 1000
Count is typically time-consuming, using mapreduce for statistics, and the results are cached, by default, by 10 rows. The statistical interval defaults to 1000 rows (INTERVAL).
(8) Disable and enable operation
Many operations need to suspend the availability of the table, such as the alter operation mentioned above, which is also required to delete the table. Disable_all and Enable_all are able to manipulate more tables.
(9) Deletion of the table
Stop the usability of the table, and then execute the delete command.
Drop ' t1′
The above is a number of commonly used commands detailed, specific to all hbase shell commands as follows, divided a few command groups, see English can be seen probably useful, detailed use of help "cmd" to understand.
Copy Code code as follows:
COMMAND GROUPS:
Group name:general
Commands:status, version
Group NAME:DDL
Commands:alter, Alter_async, Alter_status, create, describe, disable, disable_all, drop, Drop_all,
Enable, Enable_all, exists, is_disabled, is_enabled, list, show_filters
Group NAME:DML
Commands:count, delete, DeleteAll, get, Get_counter, incr, put, scan, truncate
Group Name:tools
Commands:assign, Balance_switch, balancer, close_region, compact, flush, hlog_roll, Major_compact,
Move, Split, Unassign, Zk_dump
Group name:replication
Commands:add_peer, Disable_peer, Enable_peer, List_peers, Remove_peer, Start_replication,
Stop_replication
Group name:security
Commands:grant, REVOKE, user_permission
4. HBase Shell Script
Since it is a shell command, it is also possible to write all the hbase shell commands into a single file, in order to execute all commands in a sequential way, as the Linux shell script program does. Like writing a Linux shell, write all the hbase shell commands in a file, and then execute the following command:
Copy Code code as follows:
$ hbase Shell Test.hbaseshell
Convenient and easy to use.