Hbase is a distributed, column-oriented open source database. Its implementation is based on Google's bigtable theory and hadoop HDFS file system. Hbase is different from general relational databases (RDBMS ). Hbase is a column-based database applicable to unstructured data storage.
The following content is based on the fact that hadoop and hbase have been installed.
- I. hbase shell Introduction
Hbase shell is one of the interfaces for user interaction with hbase. Of course, you can also use other methods, such as Java API, to list the basic commands for hbase:
Operation |
Command expression |
Note: |
Create a table |
Create 'table _ name, 'family1 ', 'family2', 'familyn' |
|
Add record |
Put 'table _ name', 'rowkey', 'Family: column', 'value' |
|
View records |
Get 'table _ name', 'rowkey' |
Querying a single record is also the most common command of hbase. |
View the total number of records in the table |
Count 'table _ name' |
This command is not fast, and no faster method is found to count the number of rows. |
Delete record |
Delete 'table _ name', 'rowkey', 'Family _ name: column' Deleteall 'table _ name', 'rowkey' |
The first method is to delete data in a single record column. Method 2: Delete the entire record |
|
Delete a table |
1. Disable 'table _ name' |
|
2. Drop 'table _ name' |
View All records |
Scan "table_name", {Limit => 10} |
Limit => 10: only 10 records are returned; otherwise, all records are displayed. |
The above basic commands can be used to complete basic hbase operations. The following shell commands can play a significant role in subsequent hbase operations, and mainly reflect the process of table creation, let's take a look at the following create attributes. 1. The default value of bloomfilter is none. Do you want to use Bloom? Which method of bloom filter can be enabled separately for each column family. Use hcolumndescriptor. setbloomfiltertype (none | row | rowcol) to enable bloom for the column family separately. Default = none: No bloom filtering. For row, the hash of the row key is added to bloom each time a row is inserted. For rowcol, the hash modified by row key, column family, and column family will be added to bloom every time a row is inserted. Usage: Create 'table ', {bloomfilter => 'row'} enabling bloom filtering can save the required disk read process and improve read latency. 2. versions is 3 by default. This parameter indicates that three versions of data are retained, if we think that our data is not so big, we need to keep so much and update it at any time, and the old version of the data is of no value to us, set this parameter to 1 to save 2/3 of the space usage: Create 'table ', {versions => '2'} 3. The default value of compression is none. That is, no compression is used. This parameter indicates whether the column family adopts compression. Which compression algorithm is used for this parameter: create 'table', {name => 'info', compression => 'snappy'} I suggest using the snappy compression algorithm. There are many compression algorithms on the Internet, I will extract one from the Internet As a reference, the specific snappy installation will be described in a separate chapter later. This table is a set of test data released by Google a few years ago. The actual test snappy is similar to the one listed in the following table. In hbase, before the release of snappy (Google released snappy in 2011), The lzo algorithm is used to achieve the fastest compression and decompression speed and reduce CPU consumption; after the release of snappy, we recommend that you use the snappy algorithm (see hbase: the definitive guide). You can perform a more detailed comparison test on lzo and snappy based on the actual situation and then make a selection.
Algorithm |
% Remaining |
Encoding |
Decoding |
Gzip |
13.4% |
21 MB/S |
118 MB/S |
Lzo |
20.5% |
135 MB/S |
410 MB/S |
Zippy/snappy |
22.2% |
172 MB/S |
409 MB/S |
If the table is not compressed at the beginning and you want to add a compression algorithm later, what should we do? hbase has another command alter4 and alter usage: such as modifying the compression algorithm?
1 disable ‘table‘ 2 alter ‘table‘,{NAME=>‘info‘,COMPRESSION=>‘snappy‘}3 enable ‘table‘
Delete columnfamily
disable ‘table‘ alter ‘table‘,{NAME=>‘info‘,METHOD=>‘delete‘} enable ‘table‘
However, after the modification, we found that the table data was still so large that it did not change much. What should I do after the major_compact 'table' command. 5. The default TTL value is 2147483647, that is, integer. the max_value value is about 68 years. This parameter indicates that the survival time of the columnfamily data, that is, the unit of Data lifecycle is MS, which is the unit of Data Writing. This parameter allows you to set the survival time for data based on specific requirements. Data that exceeds the storage time is not displayed in the table, why does the next time you delete the data in major compact, we will introduce it in detail. Note that after the TTL is set, min_versions => '0'. After the TTL timestamp expires, all data in the family will be completely deleted, if min_versions is not equal to 0, the latest min_versions versions of data will be retained, and all others will be deleted. For example, min_versions => '1' will retain the latest version of data, data of other versions will not be saved. 6. Run the describe 'table' command to view the parameters or default values of create table. 7. disable_all 'toplist. * 'disable_all supports regular expressions and lists the currently matched tables as follows:
Toplist_a_total_1001
Toplist_a_total_1002
Toplist_a_total_1008
Toplist_a_total_1009
Toplist_a_total_1019
Toplist_a_total_1035
...
Disable the above 25 tables (y/n )? The confirmation prompt is displayed. 8. The drop_all command is used in the same way as disable_all. 9. hbase table pre-partitioning is also a manual partition.
By default, a region partition is automatically created when an hbase table is created. when data is imported, all hbase clients write data to this region, it is not until this region is large enough to be split. One way to speed up batch writing is to create some empty regions in advance, so that when data is written to hbase, data load balancing is performed in the cluster according to the region partition. Usage: Create 't1', 'f1', {numregions => 15, splitalgo => 'hexstringsplit '}. You can also use hbase Org. apache. hadoop. hbase. util. regionsplitter test_table hexstringsplit-C 10-F info parameters are easy to understand. test_table is the table name hexstringsplit. Split mode:-C: 10 region-F: family. In this way, the table can be divided into 10 in advance. zones, reduce the time consumption of automatic partitioning when data reaches the storefile size, and also take advantage of the rational design of rowkey to evenly distribute concurrent requests of each region (tend to be even) to achieve the highest Io efficiency, but the pre-partition needs to set a large value for filesize. Which parameter should be set? hbase. The default value of hregion. Max. filesize is 10 Gb. That is to say, the default size of a single region is 10 Gb. This value is changed from 0.90 to 0.92 to 0.94.3 from 256m--1g--10g as needed. However, if the mapreduce input type is tableinputformat and hbase is used as the input, note that each region has a map, if the data is less than 10 Gb, only one map is enabled, which can cause a great waste of resources. In this case, you can reduce the value of this parameter appropriately, or pre-allocate region to hbase. hregion. max. filesize is set to a relatively large value, which is not easily reached, for example, 1000 GB. If this value is reached, You can manually allocate region. From: http://blog.csdn.net/zhouleilei/article/details/12654329
Hbase shell operations