Hive is a data Warehouse architecture built on Hadoop. It provides:
• A convenient set of tools for implementing data extraction (ETL).
• A mechanism for users to describe their structure to the data.
• Support the ability of users to query and analyze massive amounts of data stored in Hadoop.
The basic feature of Hive is that it uses HDFS for data storage and uses Map/reduce framework for data manipulation. So essentially, Hive is a compiler that transforms the user's operations (query or ETL) into map/reduce tasks, using the Map/reduce framework to perform these tasks to process the massive amounts of data on HDFs.
Hive is designed as a batch processing system. It uses the Map/reduce framework to process data. Therefore, it has higher overhead on map/reduce task submission and scheduling. Even for small datasets (hundreds of trillion), latency is also minute. But its biggest advantage is that the delay is linearly increased relative to the dataset size.
Hive defines a simple class SQL query Language hiveql that makes it easy for users familiar with SQL to query. At the same time, HIVEQL also allows programmers familiar with the Map/reduce framework to insert custom mapper and reducer scripts into the query to extend Hive's built-in functionality to perform more complex analysis.
Hive Features
High performance query and analysis system for massive data
Because the Hive query is implemented through the MapReduce framework, MapReduce itself is designed to achieve high-performance processing of massive data. So Hive can efficiently handle massive amounts of data.
At the same time, Hive for hiveql to mapreduce translation of a large number of optimizations to ensure that the resulting MapReduce task is efficient. In practical applications, Hive can efficiently handle TB or even petabytes of data.
Query Language for Class SQL
HIVEQL is very similar to SQL, so a user who is familiar with SQL can easily use Hive for complex queries without training.
HIVEQL Flexible Scalability (extendibility)
In addition to the capabilities provided by HIVEQL, users can customize the data types they use, customize mapper and reducer scripts in any language, and customize functions (normal functions, aggregate functions), and so on. This gives hiveql great scalability. Users can use this scalability to implement very complex queries.
High scalability (scalability) and fault tolerance
The hive itself has no enforcement mechanism, and the execution of user queries is implemented through the MapReduce framework. Because the MapReduce framework itself is highly scalable (the computational power linearly increases as the number of machines in the Hadoop cluster increases) and high fault tolerance, hive has these characteristics.
Fully compatible with other Hadoop products
Instead of storing user data, Hive accesses user data through an interface. This enables hive to support a variety of data sources and data formats. For example, it supports processing of multiple file formats (textfile, sequencefile, etc.) on HDFS and also supports processing of HBase databases. Users can also fully implement their own drivers to add new data sources and data formats. An ideal application model is to realize real-time access to data storage in HBase, and use hive to analyze the data in HBase.