Hive is a basic data warehouse architecture built on Hadoop. It provides a series of tools for data extraction, conversion, and loading.
Hive is a basic data warehouse architecture built on Hadoop. It provides a series of tools for data extraction, conversion, and loading.
Basic Hive learning documents and tutorials
Abstract:
Hive is the basic architecture of data warehouse built on Hadoop. It provides a series of tools for data extraction, conversion, and loading (ETL). This is a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. Hive defines a simple SQL-like query language called QL, which allows users familiar with SQL to query data. At the same time, this language also allows developers familiar with MapReduce to develop custom mapper and reducer to handle complicated analysis tasks that cannot be completed by built-in mapper and reducer.
Directory:
HIVE Structure
HIVE metabase
DERBY
Mysql
HIVE Data Storage
Other HIVE operations
Basic HIVE operations
Createtable
AlterTable
Insert
Inserting data into HiveTables from queries
Writing data into filesystem from queries
Hive Select
GroupBy
OrderSort
Hive Join
HIVE parameter settings
HIVE UDF
Basic functions
UDTF
Explode
HIVE MAPREDUCE
Notes for using HIVE
Insert
Optimization
HADOOP computing framework features
Common Optimization Methods
Full sorting
Example 1
Example 2
JOIN
JOIN Principle
Map Join
HIVE FAQ
Common Reference Path
1. HIVE Structure
Hive is the basic architecture of data warehouse built on Hadoop. It provides a series of tools for data extraction, conversion, and loading (ETL). This is a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. Hive defines a simple SQL-like query language called QL, which allows users familiar with SQL to query data. At the same time, this language also allows developers familiar with MapReduce to develop custom mapper and reducer to handle complicated analysis tasks that cannot be completed by built-in mapper and reducer.
1.1 HIVE Architecture
The Hive structure can be divided into the following parts:
· User interfaces: including CLI, Client, and WUI
· Metadata storage. It is usually stored in relational databases such as mysql and derby.
· Interpreter, compiler, optimizer, and executor
· Hadoop: Uses HDFS for storage and MapReduce for computing
1. There are three main user interfaces: CLI, Client, and WUI. The most common one is CLI. When Cli is started, a Hive copy is started at the same time. The Client is the Hive Client, and the user connects to the Hive Server. When starting the Client mode, you must specify the node where the Hive Server is located and start the Hive Server on the node. WUI accesses Hive through a browser.
2. Hive stores metadata in databases, such as mysql and derby. The metadata in Hive includes the table name, the table's columns and partitions and their attributes, the table's attributes (whether it is an external table or not), and the table's data directory.
3. the interpreter, compiler, and optimizer complete the generation of HQL query statements from lexical analysis, syntax analysis, compilation, optimization, and query plan. The generated query plan is stored in HDFS and executed by MapReduce later.
4. Hive data is stored in HDFS. Most of the queries are completed by MapReduce (including * queries. For example, select * from tbl does not generate MapRedcue tasks ).
1.2 Hive
Relationship with Hadoop
Hive is built on Hadoop,
· Hive completes interpretation, optimization, and generation of query plans for query statements in HQL.
· All data is stored in Hadoop.
· The query plan is converted to MapReduce tasks and executed in Hadoop (some queries do not have MR tasks, such as select * from table)
Hadoop and Hive are both coded in UTF-8.
1.3 Hive
Similarities and differences with common relational databases
HiveRDBMS
Query Language HQLSQL
HDFSRaw Device or Local FS
No index available
Run MapReduceExcutor.
Execution latency
Data processing scale
1. Query Language. SQL is widely used in data warehouses. Therefore, HQL is a query language for SQL. Developers familiar with SQL development can easily use Hive for development.
2. Data storage location. Hive is built on Hadoop, and all Hive data is stored in HDFS. Databases can store data on Block devices or local file systems.
3. data format. No special data format is defined in Hive. You can specify the data format. You must specify three attributes for the custom data format: column separator (usually space, "\ t", "\ x001"), line separator ("\ n ") and how to read file data (Hive has three default file formats: TextFile, SequenceFile, and RCFile ). During data loading, you do not need to convert the user data format to the data format defined by Hive. Therefore, Hive does not modify the data itself during the loading process, instead, you only need to copy or move the data content to the corresponding HDFS.
Directory. In databases, different databases have different storage engines and define their own data formats. All data is stored according to a certain organization. Therefore, the process of loading data from a database is time-consuming.
4. Data Update. Because Hive is designed for Data Warehouse applications, data warehouse content is read-write-less. Therefore, Hive does not support data rewriting and addition. All data is determined during loading. The data in the database usually needs to be modified frequently, so you can use insert... Add data using VALUES, UPDATE... SET to modify data.
5. index. As we have already said, Hive does not process or even scan data during data loading, so it does not index some keys in the data. When Hive wants to access a specific value that meets the conditions in the data, it needs to scan the entire data brute force, so the access latency is high. Due to the introduction of MapReduce, Hive can access data in parallel. Therefore, even if there is no index, Hive can still show its advantages in accessing large data volumes. In a database, indexes are usually created for one or more columns. Therefore, for access to a small amount of data with specific conditions, the database can have high efficiency and low latency. The high latency of data access determines
Hive is not suitable for online data query.
6. Execute. In Hive, most queries are executed through MapReduce provided by Hadoop (similar to select * from tbl queries do not require MapReduce ). Databases usually have their own execution engines.