Hadoop White Paper (4): Introduction to Data Warehouse Hive

Source: Internet
Author: User
Keywords Hadoop Data Warehouse hive
Tags basic compiler customize data data extraction data manipulation data sources data storage

Hive is a data Warehouse architecture built on Hadoop. It provides:

• A convenient set of tools for implementing data extraction (ETL).

• A mechanism for users to describe their structure to the data.

• Support the ability of users to query and analyze massive amounts of data stored in Hadoop.

The basic feature of Hive is that it uses HDFS for data storage and uses Map/reduce framework for data manipulation. So essentially, Hive is a compiler that transforms the user's operations (query or ETL) into map/reduce tasks, using the Map/reduce framework to perform these tasks to process the massive amounts of data on HDFs.

Hive is designed as a batch processing system. It uses the Map/reduce framework to process data. Therefore, it has higher overhead on map/reduce task submission and scheduling. Even for small datasets (hundreds of trillion), latency is also minute. But its biggest advantage is that the delay is linearly increased relative to the dataset size.

Hive defines a simple class SQL query Language hiveql that makes it easy for users familiar with SQL to query. At the same time, HIVEQL also allows programmers familiar with the Map/reduce framework to insert custom mapper and reducer scripts into the query to extend Hive's built-in functionality to perform more complex analysis.

Hive Features

High performance query and analysis system for massive data

Because the Hive query is implemented through the MapReduce framework, MapReduce itself is designed to achieve high-performance processing of massive data. So Hive can efficiently handle massive amounts of data.

At the same time, Hive for hiveql to mapreduce translation of a large number of optimizations to ensure that the resulting MapReduce task is efficient. In practical applications, Hive can efficiently handle TB or even petabytes of data.

Query Language for Class SQL

HIVEQL is very similar to SQL, so a user who is familiar with SQL can easily use Hive for complex queries without training.

HIVEQL Flexible Scalability (extendibility)

In addition to the capabilities provided by HIVEQL, users can customize the data types they use, customize mapper and reducer scripts in any language, and customize functions (normal functions, aggregate functions), and so on. This gives hiveql great scalability. Users can use this scalability to implement very complex queries.

High scalability (scalability) and fault tolerance

The hive itself has no enforcement mechanism, and the execution of user queries is implemented through the MapReduce framework. Because the MapReduce framework itself is highly scalable (the computational power linearly increases as the number of machines in the Hadoop cluster increases) and high fault tolerance, hive has these characteristics.

Fully compatible with other Hadoop products

Instead of storing user data, Hive accesses user data through an interface. This enables hive to support a variety of data sources and data formats. For example, it supports processing of multiple file formats (textfile, sequencefile, etc.) on HDFS and also supports processing of HBase databases. Users can also fully implement their own drivers to add new data sources and data formats. An ideal application model is to realize real-time access to data storage in HBase, and use hive to analyze the data in HBase.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.