Hbase is an open-source implementation of Google bigtable. It uses hadoop HDFS as its file storage system, hadoop mapreduce to process massive data in hbase, and zookeeper as a collaborative service.
1. Introduction
Hbase is a distributed, column-oriented open-source database. It originated from Google's paper bigtable: a distributed storage system for structured data. Hbase is an open-source implementation of Google bigtable. It uses hadoop HDFS as its file storage system, hadoop mapreduce to process massive data in hbase, and zookeeper as a collaborative service.
2. hbase table structure
Hbase stores data in tables. A table consists of rows and columns. The column is divided into several column families/column families ).
Row key |
Column-family1 |
Column-family2 |
Column-family3 |
Column1 |
Column2 |
Column1 |
Column2 |
Column3 |
Column1 |
Key1 |
|
|
|
|
|
|
Key2 |
|
|
|
|
|
|
Key3 |
As shown in, key1, key2, key3 is the unique row key value of three records, column-family1, column-family2, and column-family3 are three columns, each containing several columns. For example, the column-family1 family contains two columns named column1 and column2, T1: ABC, T2: gdxdf is a cell uniquely identified by row key1 and the column-family1-column1. The cell contains two data types: ABC and gdxdf. The timestamps of the two values are different, T1, T2, and hbase returns the value of the latest time to the requester.
The specific meanings of these terms are as follows:
(1) Row key
Like nosql databases, row keys are the primary keys used to retrieve records. There are only three methods to access rows in hbase table:
(1.1) access through a single row key
(1.2) use the range of the row key
(1.3) full table Scan
The row key can be any string (the maximum length is 64 KB, and the actual length is generally 10-bytes). In hbase, the row key is saved as a byte array.
Data is stored in the Lexicographic Order (byte order) of the row key. When designing keys, you need to fully sort and store the rows that are frequently read together. (Location correlation)
Note:
The result of the lexicographically ordered int is 1, 10, 11, 12, 13, 16, 17, 18, 19, 21 ,..., 9,91, 92,93, 94,95, 96,97, 98,99. To maintain the natural order of the integer, the row key must be left filled with 0.
One read/write operation on a row is an atomic operation (no matter how many columns are read/written at a time ). This design decision makes it easy for users to understand the program's behavior when performing concurrent update operations on the same row.
(2) column family
Each column in The hbase table belongs to a column family. A column family is a part of the table's Chema (rather than a column) and must be defined before the table is used. All column names are prefixed with column families. For example, courses: History and courses: math all belong to the courses column family.
Access control, disk and memory usage statistics are all performed at the column family level. In practical applications, the control permissions on the columnfamily can help us manage different types of applications: we allow some applications to add new basic data, some applications to read basic data and create inherited columnfamily, and some applications to only browse data (or even not because of privacy ). all data ).
(3) Cell
In hbase, a storage unit identified by row and columns is called cell. Uniquely identified by {row key, column (= <family> + <label>), and version. The data in cell is of no type and all are stored in bytecode format.
(4) Timestamp
Each cell stores multiple versions of the same data. Versions are indexed by timestamps. The timestamp type is a 64-bit integer. The timestamp can be assigned by hbase (automatically when data is written). The timestamp is accurate to the current system time in milliseconds. The timestamp can also be explicitly assigned by the customer. To avoid data version conflicts, the application must generate a unique timestamp. In each cell, data of different versions are sorted in reverse chronological order, that is, the latest data is ranked first.
To avoid the management (including storage and indexing) burden caused by excessive data versions, hbase provides two data version recycling methods. The first is to save the last n versions of the data, and the second is to save the versions (for example, the last seven days) in the recent period ). You can set for each column family.
3. Basic usage of hbase Shell
Hbase provides a shell terminal for user interaction. Run the hbase shell command to enter the command interface. Run the HELP command to view the help information of the command.
The usage of hbase is demonstrated using an example of an online student sequence table.
Name |
Grad |
Course |
Math |
Art |
Tom |
5 |
97 |
87 |
Jim |
4 |
89 |
80 |
Here, grad is a column family only for a table. course is a column family with two columns for the table. This column family consists of two columns: Math and art, of course, we can create more columns in course as needed, such as computer and physics, and add columns to the course columnfamily.
(1) create a table scores with two columnfamilies grad and courese
Copy the Code as follows:
Hbase (main): 001: 0> Create 'scores', 'case', 'Course'
You can use the LIST command to view the tables in the current hbase. Use the describe command to view the table structure. (Remember to enclose all the representations and column names with quotation marks)
(2) Insert values based on the designed table structure:
Copy the Code as follows:
Put 'scores', 'Tom ', 'grade:', '5 ′
Put 'scores', 'Tom ', 'course: math', '97 ′
Put 'scores', 'Tom ', 'course: art', '87 ′
Put 'scores', 'jim', 'grade ', '4 ′
Put 'scores', 'jim', 'course: ', '89 ′
Put 'scores', 'jim', 'course: ', '80 ′
In this way, the table structure becomes more free. It is very convenient to add sub-columns freely in the column family. If there are no child columns in the column family, you can add a colon without adding it.
The PUT command is relatively simple. only use this method:
Hbase> put 't1', 'r1 ', 'c1', 'value', ts1
T1 indicates the table name, R1 indicates the row key name, C1 indicates the column name, and value indicates the cell value. Ts1 indicates the timestamp, which is generally omitted.
(3) querying data based on key values
Get 'scores', 'jim'
Get 'scores', 'jim', 'grade'
You may find the rule. an approximate sequence of hbase shell operations is the sequence of Operation keywords followed by table name, row name, and column name. If there are other conditions, add them in curly brackets.
The get method is as follows:
Hbase> Get 't1', 'r1 ′
Hbase> Get 't1', 'r1', {timerange => [ts1, ts2]}
Hbase> Get 't1', 'r1', {Column => 'c1 ′}
Hbase> Get 'T1 ', 'r1', {Column => ['c1', 'c2', 'c3']}
Hbase> Get 't1', 'r1 ', {Column => 'c1', timestamp => ts1}
Hbase> Get 't1', 'r1 ', {Column => 'c1', timerange => [ts1, ts2], versions => 4}
Hbase> Get 'T1 ', 'r1', {Column => 'c1', timestamp => ts1, versions => 4}
Hbase> Get 't1', 'r1 ', 'c1 ′
Hbase> Get 't1', 'r1 ', 'c1', 'c2 ′
Hbase> Get 't1', 'r1', ['c1', 'c2']
(4) scan all data
Scan 'scores'
You can also specify modifiers: timerange, filter, limit, startrow, stoprow, timestamp, maxlength, or columns. There is no modifier, that is, the above example, all data rows will be displayed.
Example:
Copy the Code as follows:
Hbase> scan '. Meta .'
Hbase> scan '. Meta.', {columns => 'info: regioninfo '}
Hbase> scan 't1', {columns => ['c1 ', 'c2'], limit => 10, startrow => 'xyz '}
Hbase> scan 't1', {columns => 'c1', timerange => [1303668804,130 3668904]}
Hbase> scan 't1', {filter => "(prefixfilter ('row2') and (qualifierfilter (>=, 'binary: xyz '))) and (timestampsfilter (123,456 ))"}
Hbase> scan 't1', {filter => org. Apache. hadoop. hbase. Filter. columnpaginationfilter. New (1, 0 )}
The filter has two methods:
A. Using a filterstring-more information on this is available in
Filter language document attached to the HBASE-4176 Jira
B. Using the entire package name of the filter.
There is also a cache_blocks modifier that switches the scan cache. By default, it is enabled (cache_blocks => true). You can choose to disable it (cache_blocks => false ).
(5) delete specified data
Copy the Code as follows:
Delete 'scores', 'jim', 'grade'
Delete 'scores', 'jim'
The data deletion command does not change much either. There is only one:
Hbase> Delete 't1', 'r1 ', 'c1', ts1
In addition, there is a deleteall command, which can be used to delete the entire row range. Use it with caution!
If you need to delete a full table, use the truncate command. In fact, there is no direct full table deletion command. This command is also combined by the disable, drop, and create commands.
(6) modify the table structure
Copy the Code as follows:
Disable 'scores'
Alter 'scores', name => 'info'
Enable 'scores'
Run the alter command as follows (if the version cannot be successful, the general table disable must be used first ):
A. change or add a column family:
Hbase> alter 't1', name => 'f1', versions => 5
B. delete a columnfamily:
Copy the Code as follows:
Hbase> alter 't1', name => 'f1', method => 'delete'
Hbase> alter 't1', 'delete' => 'f1 ′
C. You can also modify table attributes such as max_filesize.
Memstore_flushsize, readonly, and deferred_log_flush:
Hbase> alter 't1', method => 'table _ ATT ', max_filesize => '123 ′
D. You can add a table coprocessor.
Hbase> alter 't1', method => 'table _ ATT ', 'coprocessor' => 'hdfs: // Foo. jar | com. foo. fooregionobserver | 1001 | arg1 = 1, arg2 = 2 ′
Multiple coprocessor can be configured for a table, and a sequence will automatically grow for identification. To load a coprocessor (a filter program), you must comply with the following rules:
[Coprocessor JAR file location] | class name | [Priority] | [arguments]
E. Remove coprocessor as follows:
Hbase> alter 't1', method => 'table _ att_unset', name => 'max _ filesize'
Hbase> alter 't1', method => 'table _ att_unset', name => 'coprocessor $1 ′
F. Multiple alter commands can be executed at a time:
Hbase> alter 't1', {name => 'f1 '}, {name => 'F2', method => 'delete '}
(7) number of statistical rows:
Copy the Code as follows:
Hbase> count 't1 ′
Hbase> count 'T1 ', interval => 100000
Hbase> count 't1', cache => 1000
Hbase> count 't1', interval => 10, cache => 1000
Count is generally time-consuming. mapreduce is used for statistics and the statistical results are cached. The default value is 10 rows. The default statistical interval is 1000 rows (interval ).
(8) Disable and enable Operations
In many operations, you need to suspend table availability first, such as the alter operation mentioned above. This operation is also required to delete a table. Disable_all and enable_all can operate more tables.
(9) delete a table
Stop the table's usability and run the DELETE command.
Drop 't1 ′
The above is a detailed explanation of some common commands. The specific shell commands for all hbase are as follows, which are divided into several command groups. It can be seen that the commands are useful in English, use help "cmd" for detailed usage.
Copy the Code as follows:
Command groups:
Group Name: General
Commands: Status, version
Group Name: DDL
Commands: Alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all,
Enable, enable_all, exists, is_disabled, is_enabled, list, show_filters
Group Name: DML
Commands: Count, delete, deleteall, get, get_counter, incr, put, scan, truncate
Group Name: Tools
Commands: Assign, balance_switch, balancer, close_region, compact, flush, hlog_roll, major_compact,
Move, split, unassign, zk_dump
Group Name: Replication
Commands: add_peer, disable_peer, enable_peer, list_peers, remove_peer, start_replication,
Stop_replication
Group Name: Security
Commands: Grant, revoke, user_permission
4. hbase shell script
Since it is a shell command, of course, you can also write all hbase shell commands into a file and execute all commands in sequence like the Linux Shell script program. Like writing a Linux Shell, write all hbase shell commands in a file, and then execute the following command:
Copy the Code as follows:
$ Hbase shell test. hbaseshell
Easy to use.
Hbase shell basics and Common commands