Talking about Hive vs. HBase

Source: Internet
Author: User
Keywords Talking about can for example live

For users who have just come into contact with large data, it is difficult to distinguish between hive and hbase. This paper will try to analyze it from the aspects of its definition, characteristic, limitation and application scene.

What is Hive?

The Apache hive is a data warehouse at the top of the Hadoop (Distributed system infrastructure), noting that this is not a database. Hive can be viewed as a user programming interface that does not store and compute data itself; it relies on HDFs (Hadoop Distributed File System) and MapReduce (a programming model, mapping and simplification; For large data parallel operations). Its operations on HDFs are similar to the sql-named HQL, which provides a rich SQL query to analyze the data stored in HDFs, hql after compiling to MapReduce job, and then querying the contents of the analysis through its own SQL; So, even if you're unfamiliar with MapReduce Users can also easily use SQL language to query, summarize, and analyze data. MapReduce developers can use their own mapper and reducer as plug-ins to support hive for more complex data analysis.

What is HBase?

Apache HBase is a database system of NoSQL (=not only SQL, a generic database) that runs on top of HDFs. Unlike Hive,hbase, which has the ability to read and write, is a column-oriented database. HBase stores data as a table consisting of rows and columns, divided into several columns (row accessibility). For example, a message list contains the sender, recipient, Date sent, message header, and message content. Each pair of key values in HBase is defined as a cell in which the key is composed of row-key (row keys), column clusters, columns, and timestamps. In HBase, each row represents a combination of key-value mappings identified by a row key. The hbase target relies on horizontal scaling to increase computing and storage capabilities by increasing the availability of Low-cost commercial servers.

Characteristics

JDBC-compliant hive not only allows users with SQL knowledge to perform mapreduce operations indirectly, but it also incorporates the current sql-based operational tools. However, because the default data reads are all-table traversal, the time consuming is inevitably relatively large. However, the different hive partitioning method can limit the amount of data being read by traversal. The hive partition allows filtering of data stored on separate files and returns filtered data. For example, log file access for dates, provided that the file name of the class contains date information.

HBase stores data in the form of key-value pairs. It includes 4 main data manipulation methods:

Add or update rows of data

Scan to get a range of cells

Returns the corresponding cells for a specific data row

Deletes a data row/column from a datasheet, or a column's descriptive information

Column information can be used to get values before data changes (the HBase compression policy allows you to remove the column information history to free up storage space).

Limit

Hive does not support regular SQL UPDATE statements, such as data inserts, updates, deletes. Because its operations on the data are for the entire datasheet. This feature also makes it possible to calculate data queries in minutes or even hours. In addition, its mapreduce conversion process must conform to predefined conversion rules.

HBase's data query is a set of their own SQL-like operating language, which requires a certain degree of learning to master. In addition, to run the hbase,zookeeper is required. Zookeeper is a reliable coordination system for large distributed systems, including configuration maintenance, name service, distributed synchronization, group service, etc.

Application examples

Hive is suitable for large data, static data query, such as blog. For example: User consumption behavior record, website visit footprint and so on. However, it is not suitable for on-line real-time online query.

HBase can work in large data online real-time queries. For example, Facebook uses its online real-time analysis of messages sent between users.

Summary

Hive and HBase are based on different technologies on Hadoop. Hive is a kind of SQL programming interface that can perform MapReduce job, HBase is a kind of non relational database structure. Combined with their own characteristics, the use of each other may be able to receive a complementary effect. For example: Using hive to process static off-line data, using hbase to conduct online real-time query, and then integrate the result set between the two, so that the data is complete and young, for further business analysis to provide good support.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.