Hadoop White Paper (4): Introduction to Data Warehouse Hive

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Hadoop Data Warehouse hive

Tags basic compiler customize data data extraction data manipulation data sources data storage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hive is a data Warehouse architecture built on Hadoop. It provides:

• A convenient set of tools for implementing data extraction (ETL).

• A mechanism for users to describe their structure to the data.

• Support the ability of users to query and analyze massive amounts of data stored in Hadoop.

The basic feature of Hive is that it uses HDFS for data storage and uses Map/reduce framework for data manipulation. So essentially, Hive is a compiler that transforms the user's operations (query or ETL) into map/reduce tasks, using the Map/reduce framework to perform these tasks to process the massive amounts of data on HDFs.

Hive is designed as a batch processing system. It uses the Map/reduce framework to process data. Therefore, it has higher overhead on map/reduce task submission and scheduling. Even for small datasets (hundreds of trillion), latency is also minute. But its biggest advantage is that the delay is linearly increased relative to the dataset size.

Hive defines a simple class SQL query Language hiveql that makes it easy for users familiar with SQL to query. At the same time, HIVEQL also allows programmers familiar with the Map/reduce framework to insert custom mapper and reducer scripts into the query to extend Hive's built-in functionality to perform more complex analysis.

Hive Features

High performance query and analysis system for massive data

Because the Hive query is implemented through the MapReduce framework, MapReduce itself is designed to achieve high-performance processing of massive data. So Hive can efficiently handle massive amounts of data.

At the same time, Hive for hiveql to mapreduce translation of a large number of optimizations to ensure that the resulting MapReduce task is efficient. In practical applications, Hive can efficiently handle TB or even petabytes of data.

Query Language for Class SQL

HIVEQL is very similar to SQL, so a user who is familiar with SQL can easily use Hive for complex queries without training.

HIVEQL Flexible Scalability (extendibility)

In addition to the capabilities provided by HIVEQL, users can customize the data types they use, customize mapper and reducer scripts in any language, and customize functions (normal functions, aggregate functions), and so on. This gives hiveql great scalability. Users can use this scalability to implement very complex queries.

High scalability (scalability) and fault tolerance

The hive itself has no enforcement mechanism, and the execution of user queries is implemented through the MapReduce framework. Because the MapReduce framework itself is highly scalable (the computational power linearly increases as the number of machines in the Hadoop cluster increases) and high fault tolerance, hive has these characteristics.

Fully compatible with other Hadoop products

Instead of storing user data, Hive accesses user data through an interface. This enables hive to support a variety of data sources and data formats. For example, it supports processing of multiple file formats (textfile, sequencefile, etc.) on HDFS and also supports processing of HBase databases. Users can also fully implement their own drivers to add new data sources and data formats. An ideal application model is to realize real-time access to data storage in HBase, and use hive to analyze the data in HBase.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More