The HBase learning of Hadoop

Source: Internet
Author: User

HBase Introduction

HBase is a distributed, column-oriented, open-source database that comes from the Google paper "Bigtable: A distributed storage system of structured data" written by Fay Chang. Just as BigTable leverages the distributed data store provided by the Google File system, HBase provides bigtable-like capabilities on top of Hadoop. HBase is a sub-project of the Apache Hadoop project. HBase differs from the general relational database, which is a database suitable for unstructured data storage. The other difference is that HBase is column-based instead of row-based patterns.

Hbase–hadoop Database is a highly reliable, high-performance, column-oriented, scalable distributed storage system that leverages HBase technology to build large-scale structured storage clusters on inexpensive PC servers.

Pig and Hive also provide high-level language support for HBase, making data statistics processing on hbase very simple. Sqoop provides a convenient RDBMS data import function for HBase, which makes it very convenient to migrate traditional database data to hbase.

1. HBase Installation

(1) Three modes: stand-alone mode (Standalone), pseudo-distribution mode (pseudo-distributed), fully distributed mode (Fully distributed);

(2) Choice of JAVA version:

HBase Version

JDK 6

JDK 7

JDK 8

1.2

Not supported

Yes

Yes

1.1

Not supported

Yes

Running with JDK 8 would work, but isn't well tested.

1.0

not supported

yes

running with JDK 8 would work but was not well tested .

0.98

Yes

Yes

Running with JDK 8 works It is not a well tested. Building with JDK 8 would require removal of the deprecated remove () method of the Poolmap class and is under consideration . Seehbase-7608for more information on JDK 8 support.

(3) The version of the Hadoop jar package that comes with HBASE;

The version of the Hadoop jar package that comes with the Lib directory for Hbase 0.98 is 2.2, and it is recommended to replace it with the Hadoop jar file that is using the cluster (version 2.5.2) to avoid the exception caused by a version mismatch;

2. HBASE's relationship to hive

(1) What are the two respectively?

Apache Hive is a data warehouse built on top of the Hadoop infrastructure. Hive allows you to query the data stored on HDFS using the HQL language. HQL is a class of SQL language that eventually translates to map/reduce. Although hive provides SQL query functionality, hive is not able to query interactively-because it can only execute Hadoop in batches on Haoop.

Apache HBase is a key/value system that runs on top of HDFs. Unlike Hive, HBase can run in real time on its database, rather than running a mapreduce task. HBase is partitioned into tables, and tables are further segmented into columns. A column cluster must use the schema definition, and the column family will assemble a type column (the column does not require a schema definition). For example, the "Message" Column cluster may contain: "To", "from" "Date", "Subject", and "body". Each key/value pair is defined in HBase as a cell, each key consists of Row-key, a column cluster, a column, and a timestamp. In HBase, a row is a collection of key/value mappings that are uniquely identified by Row-key. HBase leverages the infrastructure of Hadoop to scale horizontally with common devices.

(2) Characteristics of both

Hive helps people who are familiar with SQL to run the MapReduce task. Because it is JDBC-compatible, it can also be integrated with existing SQL tools. Running a hive query can take a long time because it iterates through all the data in the table by default. Despite this shortcoming, the amount of data that is traversed at once can be controlled by the partitioning mechanism of hive. Partitioning allows filtering queries to be run on datasets that are stored in different folders and that only traverse the data in the specified folder (partition) when queried. This mechanism can be used, for example, to only process files within a certain time frame, as long as these file names include the time format.

HBase works by storing key/value. It supports four main operations: Add or update rows, view a cell within a range, get a specified row, delete a specified row, column, or version of a column. Version information is used to obtain historical data (the history of each row can be deleted, and then the space can be freed through HBase compactions). Although HBase includes tables, the schema is only required by tables and columns, and the columns do not need schema. HBase tables include the Add/Count feature.

(3) Restrictions

Hive does not currently support update operations. Also, because Hive runs a bulk operation on Hadoop, it takes a long time, usually minutes to hours to get the results of the query. Hive must provide a pre-defined schema to map files and directories to columns, and hive is incompatible with acid.

HBase queries are written in a particular language that needs to be re-learned. The functionality of class SQL can be implemented through Apache Phonenix, but this is at the cost of having to provide the schema. In addition, HBase is not compatible with all ACID properties, although it supports certain features. Last but not least-to run Hbase,zookeeper is a must, zookeeper is a service for distributed coordination that includes configuration services, maintenance meta-information, and namespace services.

(4) Application Scenarios

Hive is ideal for analyzing queries for data over a period of time, such as the logs used to calculate trends or websites. Hive should not be used for real-time queries. Because it takes a long time to return the results.

HBase is ideal for real-time queries for big data. Facebook uses HBase for messages and real-time analytics. It can also be used to count the number of Facebook connections.

(5) Summary

Hive and HBase are two different technologies based on Hadoop--hive is a kind of SQL-like engine and runs the MapReduce task, HBase is a nosql key/vale database on top of Hadoop. Of course, both of these tools can be used at the same time. Like using Google to search and socialize with Facebook, hive can be used for statistical queries, HBase can be used for real-time queries, and data can be written from hive to HBase, set up to be written back to hive from HBase.

3. HBASE Shell Practice

(0) Common commands under HBase shell

Help

Help create

Several common commands create, list, put, Scan,get,alter

(1) Tips for building a table (from the website's instructions):

Create a table with namespace=ns1 and table qualifier=t1, the command is as follows:

hbase> create ' ns1:t1 ', {NAME = ' F1 ', VERSIONS = 5}

Create a table with Namespace=default and table qualifier=t1, the command is as follows:

hbase> create ' t1 ', {name = ' F1 '}, {name = ' F2 '}, {name = ' F3 '}

Hbase> # The above in shorthand would is the following:

hbase> create ' t1 ', ' F1 ', ' F2 ', ' F3 '

hbase> create ' t1 ', {NAME = ' F1 ', VERSIONS = 1, TTL = 2592000, Blockcache = true}

hbase> create ' t1 ', {NAME = ' F1 ', CONFIGURATION = {' Hbase.hstore.blockingStoreFiles ' = ' 10 '}}

Table configuration options can be put on the end, examples are as follows:

hbase> create ' ns1:t1 ', ' F1 ', splits = [' 10 ', ' 20 ', ' 30 ', ' 40 ']

hbase> create ' t1 ', ' F1 ', splits = [' 10 ', ' 20 ', ' 30 ', ' 40 ']

hbase> create ' t1 ', ' F1 ', splits_file = ' splits.txt ', OWNER = ' JohnDoe '

hbase> create ' t1 ', {NAME = ' F1 ', VERSIONS = 5}, METADATA = {' MyKey ' = ' myvalue '}

Hbase> # Optionally pre-split the table into Numregions, using

Hbase> # Splitalgo ("Hexstringsplit", "Uniformsplit" or classname)

hbase> create ' t1 ', ' F1 ', {numregions =, Splitalgo = ' hexstringsplit '}

hbase> create ' t1 ', ' F1 ', {numregions =, Splitalgo ' hexstringsplit ', CONFIGURATION = ' hbase.hregion.s ' Can.loadcolumnfamiliesondemand ' = ' + ' True '}}

You can also keep around a reference to the created table, the command is as follows

hbase> T1 = create ' t1 ', ' F1 '

(2) Execute script file

Edit a text file Sample_commands.txt

Create ' test ', ' CF '

List ' test '

Put ' test ', ' row1 ', ' cf:a ', ' value1 '

Put ' test ', ' row2 ', ' cf:b ', ' value2 '

Put ' test ', ' row3 ', ' cf:c ', ' value3 '

Put ' test ', ' row4 ', ' cf:d ', ' value4 '

Scan ' Test '

Get ' test ', ' row1 '

Disable ' test '

Enable ' Test '

To execute a text file:

./hbase Shell./sample_commands.txt

4. HBase and MongoDB, Redis, and NoSQL

Nosql = not-only SQL

Hbase,mongodb,redis is a nosql storage solution. In actual project practice, the number of storage and processing of their systems is from large to small. HBase is based on Columnstore, providing <key, Family:qualifier, timestamp> three-coordinate positioning data, due to its qualifier dynamic scalability (no schema design, can store any number of qualifier) , especially for data that stores sparse table structures (such as Internet Web page classes). HBase reads data only with a key or key range, or a full table scan.

MongoDB now has a number of advantages over HBase in class-SQL statement operations, with a two-level index that supports more complex collection lookups than HBase. The data structure of Bson makes it more straightforward to process document-based data. MongoDB also supports MapReduce, but because hbase is more tightly coupled with Hadoop, MONGO is less straightforward than hbase, which requires additional processing, in such a way that mapreduce must have attributes such as data fragmentation.

HBase and MongoDB Read and write performance is the opposite, hbase write better than random read, MongoDB seems to write performance than read performance.

Redis is an in-memory kv system that handles more data than HBase and MongoDB.

1,hbase,mongodb,cassendra Comparison of three properties: http://www.jdon.com/46128

2,nosql analysis and comparison of Hbase,mongodb,redis:

http://blog.csdn.net/likika2012/article/details/38931345

3,mongodb, Redis, hbase are all NoSQL databases, differences and different positioning

http://www.zhihu.com/question/30219620

5. HBase vs. Oracle (column and row database)

One, the main difference

    • HBase is suitable for a large number of insertions and read cases
    • the bottleneck of hbase is hard drive transfer speed, and Oracle's bottleneck is hard drive seek time.

HBase essentially has only one operation, which is insert, whose update operation is to insert a row with a new timestamp, and the deletion is to insert a row with an insertion mark. The main operation is to collect a batch of data in memory, and then bulk write to the hard disk, so the speed of its writing depends mainly on the speed of the hard drive transmission. Oracle is different because he often has to read and write randomly, so the drive head needs to constantly look for data, so the bottleneck is the hard drive seek time.

    • HBase is ideal for finding scenes that sort top N by Time
    • different indexes cause differences in behavior.
    • Oracle can do both OLTP and OLAP, but in some extreme cases (the load is very large), it is not appropriate.

Two, the limitations of HBase:

    • can only do simple key value query, complex SQL statistics do not.
    • you can only do quick queries on the row key.

Third, the traditional database of row-type storage

in a data analysis scenario, a column is often used as a query condition, and the returned results are often just columns, not all columns. The I/O performance of the row database is poor in this case, Oracle, for example, has a large data file, in which a lot of blocks are divided, and then the rows are placed in each block, rows are put in one row, squeezed together, and the block is filled, Of course, some space will be reserved for future update. The drawback of this structure is that when reading a column, such as simply reading a red labeled column, you cannot read this part of the data, you must read the entire block into memory, and then take the data out of those columns, in other words, in order to read the data of some columns in the table, I have to read the entire row of columns, Before you can read these columns.                                                                                                                     If the data of these columns is very small, such as the 1T data only accounted for 100M, in order to read 100M data but to read 1TB data into memory, it is obviously not cost-effective.

B + Index

The data access technology used in Oracle is primarily a B-number index:

From the tree and the node, you can find the leaf node, which records the key value corresponding to the position of the row.

operation on B-Tree:

B-Tree insertion-split node

b number Delete--merge node

Four-column storage

the same column of data will be squeezed together, such as squeezed in the block, when the need to read a column, the value needs to read the relevant files or blocks in memory, the entire column will be read out, so I/O will be much less.

The format of the data in the same column is similar, so you can do a large compression. This saves storage space and I/O, because the data is compressed so that the amount of data read is less.

A row database is suitable for OLTP, whereas a column database is not suitable for OLTP.

bigtable LSM (Log Struct Merge) Index

in HBase The log is the data, the data is the log, they are integrated. Why do you say that because the update of hbase inserts a row, delete is also inserted into a row, and then hit the delete tag, is not the log?

In HBase, there is the memory store, and the store file, in fact each memory store and each store file is a B + tree attached to each column family (a bit like Oracle's Index organization table, data and index is integrated), That is, the following is the column family, above the B + tree, when the data query, the first in memory store in the B + tree to find, if not found, and then to the store file to find.

If the data for a row is scattered across several columns, how do you find the data for the row? Then you need to find several B + trees, which is less efficient. So try to make each insert row of the column family is sparse, only one column family has a value, the other column family has no value.



The HBase learning of Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.