Talking about Hive vs. HBase

Last Update:2014-12-25 Source: Internet

Author: User

Keywords Talking about can for example live

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For users who have just come into contact with large data, it is difficult to distinguish between hive and hbase. This paper will try to analyze it from the aspects of its definition, characteristic, limitation and application scene.

What is Hive?

The Apache hive is a data warehouse at the top of the Hadoop (Distributed system infrastructure), noting that this is not a database. Hive can be viewed as a user programming interface that does not store and compute data itself; it relies on HDFs (Hadoop Distributed File System) and MapReduce (a programming model, mapping and simplification; For large data parallel operations). Its operations on HDFs are similar to the sql-named HQL, which provides a rich SQL query to analyze the data stored in HDFs, hql after compiling to MapReduce job, and then querying the contents of the analysis through its own SQL; So, even if you're unfamiliar with MapReduce Users can also easily use SQL language to query, summarize, and analyze data. MapReduce developers can use their own mapper and reducer as plug-ins to support hive for more complex data analysis.

What is HBase?

Apache HBase is a database system of NoSQL (=not only SQL, a generic database) that runs on top of HDFs. Unlike Hive,hbase, which has the ability to read and write, is a column-oriented database. HBase stores data as a table consisting of rows and columns, divided into several columns (row accessibility). For example, a message list contains the sender, recipient, Date sent, message header, and message content. Each pair of key values in HBase is defined as a cell in which the key is composed of row-key (row keys), column clusters, columns, and timestamps. In HBase, each row represents a combination of key-value mappings identified by a row key. The hbase target relies on horizontal scaling to increase computing and storage capabilities by increasing the availability of Low-cost commercial servers.

Characteristics

JDBC-compliant hive not only allows users with SQL knowledge to perform mapreduce operations indirectly, but it also incorporates the current sql-based operational tools. However, because the default data reads are all-table traversal, the time consuming is inevitably relatively large. However, the different hive partitioning method can limit the amount of data being read by traversal. The hive partition allows filtering of data stored on separate files and returns filtered data. For example, log file access for dates, provided that the file name of the class contains date information.

HBase stores data in the form of key-value pairs. It includes 4 main data manipulation methods:

Add or update rows of data

Scan to get a range of cells

Returns the corresponding cells for a specific data row

Deletes a data row/column from a datasheet, or a column's descriptive information

Column information can be used to get values before data changes (the HBase compression policy allows you to remove the column information history to free up storage space).

Limit

Hive does not support regular SQL UPDATE statements, such as data inserts, updates, deletes. Because its operations on the data are for the entire datasheet. This feature also makes it possible to calculate data queries in minutes or even hours. In addition, its mapreduce conversion process must conform to predefined conversion rules.

HBase's data query is a set of their own SQL-like operating language, which requires a certain degree of learning to master. In addition, to run the hbase,zookeeper is required. Zookeeper is a reliable coordination system for large distributed systems, including configuration maintenance, name service, distributed synchronization, group service, etc.

Application examples

Hive is suitable for large data, static data query, such as blog. For example: User consumption behavior record, website visit footprint and so on. However, it is not suitable for on-line real-time online query.

HBase can work in large data online real-time queries. For example, Facebook uses its online real-time analysis of messages sent between users.

Summary

Hive and HBase are based on different technologies on Hadoop. Hive is a kind of SQL programming interface that can perform MapReduce job, HBase is a kind of non relational database structure. Combined with their own characteristics, the use of each other may be able to receive a complementary effect. For example: Using hive to process static off-line data, using hbase to conduct online real-time query, and then integrate the result set between the two, so that the data is complete and young, for further business analysis to provide good support.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More