HBase is a distributed, column-oriented, open-source database derived from a Google paper, BigTable: A distributed storage system of structured data. HBase is an open source implementation of Google BigTable, which leverages Hadoop HDFs as its file storage system, leverages Hadoop MapReduce to handle massive amounts of data in HBase, and leverages zookeeper as a collaborative service.
1. Introduction
HBase is a distributed, column-oriented, open-source database derived from a Google paper, bigtable: A distributed storage system of structured data. HBase is an open source implementation of Google BigTable, which leverages Hadoop HDFs as its file storage system, leverages Hadoop MapReduce to handle massive amounts of data in HBase, and leverages zookeeper as a collaborative service.
2. The table structure of HBase
HBase stores data in the form of a table. The table is made up of rows and columns. The columns are divided into a number of column family/column families (column family).
Row Key |
column-family1 |
column-family2 |
column-f Amily3 |
column1 |
column2 |
column1 |
column2 | td> column3
column1 |
key1 |
|
| td>
|
|
|
key2 |
|
|
|
|
|
|
Key3 |
|
|
|
| td>
|
As shown in the figure above, Key1,key2,key3 is the only row key value for three records, Column-family1,column-family2,column-family3 is a three-column family, and several columns are included under each column family. For example column-family1 This column family consists of two columns, the name is Column1 and COLUMN2,T1:ABC,T2:GDXDF is a cell that is uniquely determined by row Key1 and Column-family1-column1. There are two data in this cell, ABC and GDXDF. The timestamp of two values is different, t1,t2, and HBase returns the value of the most recent time to the requestor.
The specific meanings of these nouns are as follows:
(1) Row Key
Like NoSQL databases, row key is the primary key used to retrieve records. There are only three ways to access rows in HBase table:
(1.1) Access via a single row key
(1.2) through the range of row key
(1.3) Full table scan
Row key line keys (row key) can be any string (the maximum length is 64KB, the actual application length is generally 10-100bytes), inside HBase, the row key is saved as a byte array.
When stored, the data is sorted by the dictionary order (byte order) of the row key. When designing a key, to fully sort the storage feature, put together the row stores that are often read together. (Positional dependency)
Attention:
The result of the dictionary ordering of int is 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,..., 9,91,92,93,94,95,96,97,98,99. To maintain the natural order of shaping, the row key must be left padded with 0.
One read or write of a row is an atomic operation (no matter how many columns are read or written). This design decision makes it easy for the user to understand the behavior of the program when concurrent update operations are performed on the same row.
(2) Row Family column family
Each column in an hbase table is attributed to a column family. The column family is part of the Chema of the table (and the column is not) and must be defined before the table is used. Column names are prefixed with the column family. For example Courses:history, Courses:math belong to the courses family.
Access control, disk, and memory usage statistics are performed at the column family level. In practical applications, control permissions on the column family help us manage different types of applications: we allow some apps to add new basic data, some apps can read basic data and create inherited column families, and some apps will only allow browsing data (and maybe not even browsing all data for privacy reasons).
(3) Unit cell
A storage unit identified by row and columns in HBase is called a cell. The only unit determined by {row key, column (=<family> + <label>), version}. The data in the cell is of no type and is all stored in bytecode form.
(4) Timestamp timestamp
Each cell holds multiple versions of the same piece of data. The version is indexed by time stamp. The type of timestamp is a 64-bit integer. The timestamp can be assigned by HBase (automatically when the data is written), at which time the timestamp is the current system time that is accurate to milliseconds. Timestamps can also be explicitly assigned by the customer. If your application avoids data versioning conflicts, it must generate its own unique timestamp. In each cell, different versions of the data are sorted in reverse chronological order, that is, the most recent data is in the front row.
To avoid the burden of management (including storage and indexing) caused by too many versions of data, HBase provides two ways to recover data versions. The first is to save the last n versions of the data, and the second is to save the version for the most recent period (for example, the last seven days). Users can set them for each column family.
3. Basic usage of HBase shell
HBase provides a shell terminal to interact with the user. Use the command hbase Shell to enter the command interface. You can see the Help information for the command by performing a helper.
Demonstrate the use of hbase with an example of an online Student score table.
Name |
Grad |
Course |
Math |
Art |
Tom |
5 |
97 |
87 |
Jim |
4 |
89 |
80 |
Here grad for the table is a only its own column family, course for the table is a column family of two columns, the column family consists of two columns math and art, of course, we can according to our needs in the course to build more column family, such as computer, Add the course column family to the corresponding columns such as physics.
(1) Establish a table scores, there are two grad and Courese
HBase (main):001:0> create ' scores ', ' Grade ', ' course '
You can use the list command to see which tables are in the current hbase. Use the describe command to view the table structure. (Remember all the indications, the column names need to be quoted)
(2) Insert values according to the design table structure:
Put ' scores ', ' Tom ', ' Grade: ', ' 5′
Put ' scores ', ' Tom ', ' course:math ', ' 97′
Put ' scores ', ' Tom ', ' course:art ', ' 87′
Put ' scores ', ' Jim ', ' Grade ', ' 4′
Put ' scores ', ' Jim ', ' Course: ', ' 89′
Put ' scores ', ' Jim ', ' Course: ', ' 80′
So the table structure is up, in fact, relatively free, column family inside can be free to add the child column is very convenient. If there are no child columns under the column family, it is possible to add no colons.
The put command is relatively simple, with only one usage:
Hbase> put ' t1′, ' r1′, ' c1′, ' value ', ts1
T1 refers to the table name, R1 refers to the row key name, C1 refers to the column name, value refers to the cell value. Ts1 refers to the time stamp, which is generally omitted.
(3) Querying data based on key values
Get ' scores ', ' Jim '
Get ' scores ', ' Jim ', ' Grade '
You may find the rule, hbase shell operations, a general order is the operation of keywords followed by table name, row name, column name, such as a sequence, if there are other conditions with curly braces plus.
Get has the following usage:
Hbase> get ' t1′, ' r1′
Hbase> get ' t1′, ' r1′, {timerange = [Ts1, TS2]}
Hbase> get ' t1′, ' r1′, {COLUMN = ' c1′}
Hbase> get ' t1′, ' r1′, {COLUMN = [' C1 ', ' C2 ', ' C3 ']}
Hbase> get ' t1′, ' r1′, {COLUMN = ' c1′, TIMESTAMP = ts1}
Hbase> get ' t1′, ' r1′, {COLUMN = ' c1′, Timerange = [Ts1, ts2], VERSIONS = 4}
Hbase> get ' t1′, ' r1′, {COLUMN = ' c1′, TIMESTAMP = ts1, VERSIONS = 4}
Hbase> get ' t1′, ' r1′, ' c1′
Hbase> get ' t1′, ' r1′, ' c1′, ' c2′
Hbase> get ' t1′, ' r1′, [' C1 ', ' C2 ']
(4) Scan all data
Scan ' scores '
You can also specify some modifiers: Timerange, FILTER, LIMIT, StartRow, Stoprow, TIMESTAMP, Maxlength,or COLUMNS. No modifier, just the top example, will show all rows of data.
Examples are as follows:
Hbase> Scan '. META. '
Hbase> Scan '. META. ', {COLUMNS = ' info:regioninfo '}
Hbase> Scan ' t1′, {COLUMNS = [' C1 ', ' C2 '], LIMIT = ten, StartRow = ' xyz '}
Hbase> Scan ' t1′, {COLUMNS = ' c1′, Timerange = [1303668804, 1303668904]}
Hbase> Scan ' t1′, {FILTER = ' (Prefixfilter (' row2′) and (Qualifierfilter (>=, ' binary:xyz '))) and (Timestampsfil TER (123, 456)) "}
Hbase> Scan ' t1′, {FILTER = org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new (1, 0)}
by Rowkey Fuzzy query (leftmost prefix matches) total number of rows: The following example counts the total number of r1*:
Scan ' Wyk_temp ', {FILTER = Org.apache.hadoop.hbase.filter.PrefixFilter.new ( Org.apache.hadoop.hbase.util.Bytes.toBytes (' R1 ')}
Filter filters There are two ways to indicate:
A. Using a filterstring–more information on this is available in the
Filter Language document attached to the HBASE-4176 JIRA
B. Using the entire package name of the filter.
There is also a cache_blocks modifier, the cache of the switch scan, the default is on (cache_blocks=>true), you can choose to close (Cache_blocks=>false).
(5) Delete the specified data
Delete ' scores ', ' Jim ', ' Grade '
Delete ' scores ', ' Jim '
The delete data command does not change much, only one:
hbase> Delete ' t1′, ' r1′, ' c1′, ts1
There is also a DeleteAll command, you can do the whole line of the scope of the deletion operation, with caution.
If you need to do a full table delete operation, use the TRUNCATE command, in fact, there is no direct full table Delete command, this command is also disable,drop,create three command combination.
(6) Modify table structure
Disable ' scores '
Alter ' scores ',name=> ' info '
Enable ' scores '
The ALTER command uses the following (if the version is not successful, the universal table disable is required first):
A, change or add a family of columns:
Hbase> Alter ' t1′, NAME = ' f1′, VERSIONS = 5
B. Delete a column family:
Hbase> Alter ' t1′, NAME = ' f1′, METHOD = ' delete '
Hbase> alter ' t1′, ' delete ' = ' f1′ '
C, you can also modify table properties such as Max_filesize
Memstore_flushsize, READONLY, and Deferred_log_flush:
Hbase> Alter ' t1′, METHOD = ' Table_att ', max_filesize = ' 134217728′
D, you can add a table co-processor
Hbase> Alter ' t1′, METHOD = ' Table_att ', ' coprocessor ' = ' hdfs:///foo.jar|com.foo.fooregionobserver|1001| ' arg1=1,arg2=2′
Multiple co-processors can be configured on a single table, and a sequence is automatically grown for identification. Loading a co-processor (which can be said to be a filtering program) requires the following rules:
[coprocessor jar file Location] | Class name | [Priority] | [Arguments]
E, remove coprocessor as follows:
Hbase> Alter ' t1′, METHOD = ' Table_att_unset ', NAME = ' max_filesize '
Hbase> Alter ' t1′, METHOD = ' Table_att_unset ', NAME = ' coprocessor$1′
F, you can execute multiple ALTER commands at once:
Hbase> Alter ' t1′, {name = ' f1′}, {name = ' f2′, METHOD = ' Delete '}
(7) Statistics of rows:
hbase> Count ' t1′
hbase> count ' t1′, INTERVAL = 100000
hbase> count ' t1′, CACHE = 1000
hbase> count ' t1′, INTERVAL, CACHE = 1000
Count is typically time consuming, using MapReduce for statistics, and the results are cached, with 10 rows by default. The statistical interval defaults to 1000 rows (INTERVAL).
(8) Disable and enable operation
Many operations need to pause the availability of the table first, such as the alter operation above, and the deletion of the table. Disable_all and Enable_all are able to manipulate more tables.
(9) Deletion of tables
Stop the table's usability before you execute the delete command.
Drop ' t1′
The above is a few common commands in detail, the specific all hbase shell commands are as follows, divided into several command groups, to see the English is probably useful, detailed usage using help "cmd" to understand.
COMMAND GROUPS:
group name:general
commands:status, version
Group name:ddl
commands:alter, Alter_ Async, Alter_status, create, describe, disable, disable_all, drop, Drop_all,
Enable, Enable_all, exists, is_disabled , is_enabled, list, show_filters
Group name:dml
commands:count, delete, DeleteAll, get, Get_counter, incr, put, Scan, truncate
Group name:tools
commands:assign, Balance_switch, balancer, close_region, compact, flush, Hlog_ Roll, Major_compact,
move, Split, Unassign, Zk_dump
Group name:replication
commands:add_peer, Disable_ Peer, Enable_peer, List_peers, Remove_peer, start_replication,
stop_replication
Group name:security
Commands:grant, REVOKE, user_permission
4. HBase Shell Script
Since it is a shell command, it is also possible to write all hbase shell commands into a file and execute all commands sequentially, like the Linux shell script. Like writing a Linux shell, write all hbase shell commands in a file and execute the following command:
$ hbase Shell Test.hbaseshell
Easy to use.