Basic Hive learning documents and tutorials

Last Update:2018-07-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hive is a basic data warehouse architecture built on Hadoop. It provides a series of tools for data extraction, conversion, and loading.

Abstract:

Hive is the basic architecture of data warehouse built on Hadoop. It provides a series of tools for data extraction, conversion, and loading (ETL). This is a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. Hive defines a simple SQL-like query language called QL, which allows users familiar with SQL to query data. At the same time, this language also allows developers familiar with MapReduce to develop custom mapper and reducer to handle complicated analysis tasks that cannot be completed by built-in mapper and reducer.

Directory:

HIVE Structure

HIVE metabase

DERBY

Mysql

HIVE Data Storage

Other HIVE operations

Basic HIVE operations

Createtable

AlterTable

Insert

Inserting data into HiveTables from queries

Writing data into filesystem from queries

Hive Select

GroupBy

OrderSort

Hive Join

HIVE parameter settings

HIVE UDF

Basic functions

UDTF

Explode

HIVE MAPREDUCE

Notes for using HIVE

Insert

Optimization

HADOOP computing framework features

Common Optimization Methods

Full sorting

Example 1

Example 2

JOIN

JOIN Principle

Map Join

HIVE FAQ

Common Reference Path

1. HIVE Structure

1.1 HIVE Architecture

The Hive structure can be divided into the following parts:

· User interfaces: including CLI, Client, and WUI

· Metadata storage. It is usually stored in relational databases such as mysql and derby.

· Interpreter, compiler, optimizer, and executor

· Hadoop: Uses HDFS for storage and MapReduce for computing

1. There are three main user interfaces: CLI, Client, and WUI. The most common one is CLI. When Cli is started, a Hive copy is started at the same time. The Client is the Hive Client, and the user connects to the Hive Server. When starting the Client mode, you must specify the node where the Hive Server is located and start the Hive Server on the node. WUI accesses Hive through a browser.

2. Hive stores metadata in databases, such as mysql and derby. The metadata in Hive includes the table name, the table's columns and partitions and their attributes, the table's attributes (whether it is an external table or not), and the table's data directory.

3. the interpreter, compiler, and optimizer complete the generation of HQL query statements from lexical analysis, syntax analysis, compilation, optimization, and query plan. The generated query plan is stored in HDFS and executed by MapReduce later.

4. Hive data is stored in HDFS. Most of the queries are completed by MapReduce (including * queries. For example, select * from tbl does not generate MapRedcue tasks ).

1.2 Hive
Relationship with Hadoop

Hive is built on Hadoop,

· Hive completes interpretation, optimization, and generation of query plans for query statements in HQL.

· All data is stored in Hadoop.

· The query plan is converted to MapReduce tasks and executed in Hadoop (some queries do not have MR tasks, such as select * from table)

Hadoop and Hive are both coded in UTF-8.

1.3 Hive
Similarities and differences with common relational databases

HiveRDBMS

Query Language HQLSQL

HDFSRaw Device or Local FS

No index available

Run MapReduceExcutor.

Execution latency

Data processing scale

1. Query Language. SQL is widely used in data warehouses. Therefore, HQL is a query language for SQL. Developers familiar with SQL development can easily use Hive for development.

2. Data storage location. Hive is built on Hadoop, and all Hive data is stored in HDFS. Databases can store data on Block devices or local file systems.

3. data format. No special data format is defined in Hive. You can specify the data format. You must specify three attributes for the custom data format: column separator (usually space, "\ t", "\ x001"), line separator ("\ n ") and how to read file data (Hive has three default file formats: TextFile, SequenceFile, and RCFile ). During data loading, you do not need to convert the user data format to the data format defined by Hive. Therefore, Hive does not modify the data itself during the loading process, instead, you only need to copy or move the data content to the corresponding HDFS.
Directory. In databases, different databases have different storage engines and define their own data formats. All data is stored according to a certain organization. Therefore, the process of loading data from a database is time-consuming.

4. Data Update. Because Hive is designed for Data Warehouse applications, data warehouse content is read-write-less. Therefore, Hive does not support data rewriting and addition. All data is determined during loading. The data in the database usually needs to be modified frequently, so you can use insert... Add data using VALUES, UPDATE... SET to modify data.

5. index. As we have already said, Hive does not process or even scan data during data loading, so it does not index some keys in the data. When Hive wants to access a specific value that meets the conditions in the data, it needs to scan the entire data brute force, so the access latency is high. Due to the introduction of MapReduce, Hive can access data in parallel. Therefore, even if there is no index, Hive can still show its advantages in accessing large data volumes. In a database, indexes are usually created for one or more columns. Therefore, for access to a small amount of data with specific conditions, the database can have high efficiency and low latency. The high latency of data access determines
Hive is not suitable for online data query.

6. Execute. In Hive, most queries are executed through MapReduce provided by Hadoop (similar to select * from tbl queries do not require MapReduce ). Databases usually have their own execution engines.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Basic Hive learning documents and tutorials

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Basic Hive learning documents and tutorials

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support