[Turn] hbase features and benefits

Last Update:2015-10-21 Source: Internet

Author: User

Tags value store

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

from:http://blog.jobbole.com/83614/

HBase is a NoSQL database running on Hadoop, a distributed and extensible Big Data Warehouse, which means hbase can take advantage of the distributed processing model of HDFS and benefit from the MapReduce program model of Hadoop. This means that many large tables with billions of rows and millions of columns are stored on a set of commercial hardware. In addition to the advantages of Hadoop, hbase itself is a very powerful database that integrates the ability to bring real-time queries to key/value storage patterns, as well as the ability to take offline or batch processing via MapReduce. In general, HBase allows you to query records in a large amount of data, and you can get comprehensive analysis reports from them.

Google has faced a challenge: how can real-time search results be available across the Internet? The answer is that it essentially needs to cache the Internet and redefine new ways to quickly find it on such a huge cache. To achieve this, the following techniques are defined:

Google File System GFs: Extensible Distributed File system for large, distributed, data-intensive applications.
BigTable: A distributed storage system for managing the structured data that is designed to be large: petabyte-scale data from thousands of commercial servers.
MapReduce: A program model for processing and generating related implementations of large datasets.

Shortly after Google released the documents for these technologies, we saw their open source implementations, in 2007, when Mike Cafarella released the code for the BigTable open source implementation, which he called HBase, since HBase became the top project of Apache, and run in Facebook,twitter,adobe ... Give only a few examples.

HBase is not a relational database, it requires different methods to define your data model, HBase actually defines a four-dimensional data model, and here is the definition of each dimension:

Row keys: Each row has a unique row key, the row key has no data type, and it is internally considered to be a byte array.
Column clusters: Data is organized into a column cluster in rows, with the same column family for each row, but there is no need to have the same column modifier for the same column families between rows. In the engine, HBase stores the list of columns in its own data file, so they need to be defined beforehand, and it is not easy to change the column clusters.
Column modifier: A column cluster defines a real column, which is called a column modifier, and you can think of the column modifier as the column itself.
Version: Each column can have a configurable number of versions, and you can get the data by making a version of the column modifier.

Figure 1. HBase four-dimensional Data Model

As shown in 1, the row key gets a specified row that consists of one or more column families, each with one or more column modifiers (known as columns in Figure 1), and one or more versions per column. To get the specified data, you need to know its row keys, column families, column modifiers, and versions. When designing the HBase data model, it is helpful to consider how data is being fetched. You can get HBase data in one of the following two ways:

A table scan through their row keys, or a series of row keys.
Batch operation with Map-reduce

This double-fetching approach makes hbase very powerful, typically storing data in Hadoop means it is useful for offline or batch-mode analysis (especially batch analysis), but it is not necessary for real-time access. HBase supports real-time analytics through Key/value storage, as well as support for batch analysis through Map-reduce. Let's take a look at the real-time data acquisition, as the Key/value store, key is the row key, value is a collection of the list of clusters, 2 shows.

Figure 2. HBase as a key/value Store

As you can see in Figure 2, key is the row key we mentioned, and value is a collection of columns. You can retrieve the value by key, or in other words, you can "get" the row by the row key, or you can retrieve a series of rows by the given start and end row keys, which is the table scan mentioned earlier. You cannot query the value of a column in real time, which leads to an important topic: the design of the row key.

There are two reasons why the design of a row key is important:

Table scans are operations on row keys, so the design of the row keys controls the amount of real-time/direct gain that you can perform through hbase.
When you run HBase in a production environment, it runs in the upper HDFs, the data is based on row keys through HDFs, and if all your row keys start with user-, chances are that most of your data is allocated on one node (which violates the original intent of distributed data), so Your row keys should be sufficiently differentiated to be distributed across the entire deployment.

The way you define a row key depends on how you want to access those rows. If you want to store data on a user-based basis, one strategy is to use byte queues to store row keys in HBase, so we can create a hash of the user ID (for example, MD5 or SHA-1), and then append the time (long type) to the hash. There are two keys to using hashing: (1) It is able to disperse value, the data can be distributed through the cluster, and (2) It ensures that the length of the key is consistent to make it easier to use in table scans.

Having told you enough theories, the following section shows you how to build an hbase environment and how to use it from the command line.

You can download hbase from Apache website, when writing this article, the latest version is the 0.98.5,hbase team recommended you to install HBase in the Unix/linux environment, if you want to run under Windows, you need to install Cygwin first, and run HBase on this. When you download the files, extract them to your hard drive. In addition, you need to install the Java environment, and if you haven't, download the Java Environment from the Oracle Web site. Add a variable named Hbase_home to the environment configuration, the value is that you unzip the root directory of the HBase file, and then execute the start-hbase.sh script under the Bin folder, which outputs the log file in the following directory:

$HBASE _home/logs/

You can enter the following URL in the browser to test whether the installation is correct:

http://localhost:60010

If the installation is correct, you should see the following interface.

Figure 3. HBase Management Screen

Let's start by manipulating HBase with the command line and executing the following command in the HBase Bin directory:

./hbase Shell

You should see a similar output like this:

1234567 HBase Shell; enter ‘help<RETURN>‘ for list of supported commands.Type "exit<RETURN>" to leave the HBase Shell Version 0.98.5-hadoop2, rUnknown, Mon Aug 4 23:58:06 PDT 2014hbase(main):001:0>

Create a table named Pageviews with a column cluster named info:

12345 hbase(main):002:0> create ‘PageViews‘, ‘info‘0 row(s) in 5.3160 seconds=> Hbase::Table - PageViews

Each table must have at least one column cluster, so we created info, now, look at our table and execute the following list command:

123456789 hbase(main):002:0> listTABLE PageViews 1 row(s) in 0.0350 seconds=> ["PageViews"]

As you can see, the list command returns a table named Pageviews, which allows us to get more information about the table through the describe command:

12345678910111213 hbase(main):003:0> describe ‘PageViews‘ DESCRIPTION ENABLED ‘PageViews‘, {NAME => ‘info‘, DATA_BLOCK_ENCODING => ‘NONE‘, BLOOMFILTER => ‘ROW‘, REPLICATION_SCOPE => ‘0‘, VERSIONS => ‘1‘, COMPRESSION => ‘NONE true ‘, MIN_VERSIONS => ‘0‘, TTL => ‘FOREVER‘, KEEP_DELETED_CELLS => ‘false‘, BLOCKSIZE => ‘65536‘, IN_MEMORY => ‘false‘, BLOCKCACHE => ‘true‘}1 row(s) in 0.0480 seconds

The describe command returns the details of the table, including the list of column families, where we created only one: info, now add the following data to the table, and the following command adds a new row to info:

123	hbase (main): `004` `:` `0` `> Put` `' pageviews '` `,` `' rowkey1 '` `,` `' info:page '` `,` `'/mypage '` `0` `row (s)` `in` `0.0850` `seconds`

put command inserts a new record with a row key of Rowkey1, specifies the page column under info, inserts a record with a value of/mypage, and we can then use the GET command to query through the row key rowkey1 to this record:

1234567 hbase (main):005:0> get ' Pageviews ', ' rowkey1 ' column CELL &NBSP; info:page timestamp=1410374788088, value=/mypage 1 row (s) in 0.0250 seconds

You can see the column info:page, or more specific columns, with a value of/mypage, with a timestamp indicating when the record was inserted. Let's add another line before we do a table scan:

123	`hbase(main):006:0> put ‘PageViews‘, ‘rowkey2‘, ‘info:page‘, ‘/myotherpage‘0 row(s) in 0.0050 seconds`

Now that we have two lines of records, let's check out all the records for the pageviews table:

123456789 hbase(main):007:0> scan ‘PageViews‘ ROW COLUMN+CELL rowkey1 column=info:page, timestamp=1410374788088, value=/mypage rowkey2 column=info:page, timestamp=1410374823590, value=/myotherpage2 row(s) in 0.0350 seconds

As mentioned earlier, we cannot query itself, but we can scan the table, and if you execute the Scan Table command, it will return all rows in the table, which is most likely not what you want to do. You can give the range of travel to limit the results of the return, let's insert a new record with the S Start row key in the vicinity:

1	`hbase(main):012:0> put ‘PageViews‘, ‘srowkey2‘, ‘info:page‘, ‘/myotherpage‘`

Now, if I add a point limit and want to query the record of the row key between R and S, you can use the following structure:

123456789 hbase(main):014:0> scan ‘PageViews‘, { STARTROW => ‘r‘, ENDROW => ‘s‘ }ROW COLUMN+CELL rowkey1 column=info:page, timestamp=1410374788088, value=/mypage rowkey2 column=info:page, timestamp=1410374823590, value=/myotherpage2 row(s) in 0.0080 seconds

This scan returns a record with only S, which is based on the full-line key, so the rowkey1 is larger than R, and all of it is returned. In addition, the result of scan contains the startrow of the referred range, but does not contain endrow, note that Endrow is not required, if we execute the same query only give StartRow, then we will get the row key than R large all records.

1234567891011 hbase(main):013:0> scan ‘PageViews‘, { STARTROW => ‘r‘ }ROW COLUMN+CELL rowkey1 column=info:page, timestamp=1410374788088, value=/mypage rowkey2 column=info:page, timestamp=1410374823590, value=/myotherpage srowkey2 column=info:page, timestamp=1410375975965, value=/myotherpage3 row(s) in 0.0120 seconds

HBase is a nosql, commonly known as Hadoop Database, which is open source and based on the Google Big table white paper, HBase runs on top of HDFs, so it is highly scalable and supports Hadoop Map-reduce Program Design Model. HBase has two access modes: Random access via row keys, offline or batch access via Map-reduce.

This article describes the features and benefits of HBase, and briefly reviews the key points of the row key design, and shows you how to configure the HBase environment locally, use commands to create a table, insert data, retrieve a specified row, and finally how to do a scan operation.

In the next article, "Developing HBase with Java" will show the program interface of HBase and illustrate how to use Java to manipulate hbase. In the last article in this series, "Using MapReduce for hbase data analysis," We'll show you how to use Map-reduce for offline/batch processing.

[Turn] hbase features and benefits

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More