The understanding of Hive

Source: Internet
Author: User

What is Hive

Turn from: 79102691

1. Hive Introduction

Hive is a data warehouse infrastructure built on Hadoop. It provides a range of tools that can be used for data extraction conversion loading (ETL), a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive defines a simple class-SQL query language called HQL, which allows the user who is familiar with SQL to query data. At the same time, the language also allows developers to familiarize themselves with the development of custom Mapper and reducer for the built-in mapper and reducer of complex analytical work that cannot be done.
First of all, let me say what's hive (what's hive?), see:


Hive is built on the HDFs and MapReduce of Hadoop to manage and query the Data Warehouse for structured/unstructured data.

    • Using HQL as the query interface
    • Using HDFs as the underlying storage
    • Using MapReduce as the execution layer

Hive Application , as shown in

In this case, the hive was used for the HA, and finally used Haproxy to do the proxy.

1.1. Structure Description

The structure of Hive can be divided into the following sections:

    • User interface: Includes CLI, Client, WU
    • Meta data storage. Usually stored in a relational database such as MySQL, Derby
    • Interpreter, compiler, optimizer, actuator
    • Hadoop: Use HDFS for storage and compute with MapReduce

1, the user interface has three main: Cli,client and WUI. One of the most common is when CLI,CLI starts, it initiates a Hive copy at the same time. The client is the customer of hive and the user is connected to the Hive Server. When you start the Client mode, you need to indicate the node where the hive server is located and start hive Server on that node. WUI is a browser that accesses Hive.
2. Hive stores the metadata in the database, such as MySQL, Derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and its properties, the properties of the table (whether it is an external table, etc.), the directory where the table's data resides, and so on.
3, interpreter, compiler, Optimizer complete HQL query statement from lexical analysis, grammar analysis, compilation, optimization and query plan generation. The generated query plan is stored in HDFS and subsequently executed by a MapReduce call.

1.2. The similarities and differences between hive and common DB
Hive RDBMS
Query statements HQL
Data storage Hdfs
Index 1.0.0 Version Support
Execution delay High
Processing data size Large (or massive)
Perform Mapreduce
1.3. Meta-data

Hive stores metadata in an RDBMS, often with MySQL and Derby. Because Derby only supports single-client logons, MySQL is generally used to store metadata.

1.4. Data storage

First, Hive does not have a dedicated data storage format and does not index the data, so the user can organize the tables in hive very freely, simply by telling the column and row separators in the hive data when creating the table, and hive can parse the data.
Second, all the data in hive is stored in HDFS, and hive contains the following data model: Table,external table,partition,bucket.
1. The table in hive is conceptually similar to the table in the database, and each table has a corresponding directory store data in hive. For example, a table app whose path in HDFS is:/Warehouse/app, where WH is the directory of the data warehouse specified by ${hive.metastore.warehouse.dir} in Hive-site.xml, all table numbers (not including External Table) are stored in this directory.
安装hive后,会在hdfs上创建如/user/hive/warehouse/这样的的属于hive的文件夹;如果我们在hive中创建数据库,则会在warehouse下产生一个子目录,形如/user/hive/warehouse/xxx.db;如果接着在该数据库中创建一个表,则会继续产生子目录,形如/user/hive/warehouse/xxx.db/yyyyyy;
2. Partition corresponds to dense indexes of Partition columns in the database, but Partition in Hive are organized differently from the database. In Hive, a Partition in a table corresponds to a directory below the table, and all Partition data is stored in the corresponding directory. For example: The Xiaojun table contains DT and city two Partition, which corresponds to dt = 20100801, and the HDFS subdirectory of ctry = US is:/warehouse/app/dt=20100801/ctry=us; DT = 20100801, ctry = HDFS subdirectory for CA;/warehouse/app/dt=20100801/ctry=ca
This corresponds to the way hive divides the data, which is branched by the value of a variable, a value corresponding to a branch, which corresponds to a directory, and then further branching with the next variable, that is, to further separate more directories;
如果创建表时有分区,则会在目录中产生分区标识来区分的文件,形如/user/hive/warehouse/xxx.db/yyyyyy/date=20180521,文件中即保存着相关的内容,以一定的分隔符区分字段;
3. Buckets calculates the hash for the specified column, slicing the data according to the hash value, in order to parallel, each Bucket corresponds to a file. Scatter the user column to 32 buckets, first hash the value of the user column, and the HDFs directory with a hash value of 0:/warehouse/app/dt =20100801/ctry=us/part-00000;hash value is 20 of H
DFS directory is:/WAREHOUSE/APP/DT =20100801/ctry=us/part-00020
如果指定Buckets,则date=20180521不是文件,而是文件名,然后再它的下级会产生以某一列值的hash 值为区分的文件,形如/user/hive/warehouse/xxx.db/yyyyyy/date=20180521/part-00000,文件中即保存着相关的内容
4. External Table points to data that already exists in HDFS and can create Partition. It and Table in the group of Meta data
Weaving is the same, and the actual data stored there is a big difference.

table (internal table) creation process and data loading process (both processes can be completed in the same statement), during the loading of data, the actual data will be moved to the Data Warehouse directory, and then the data to access will be directly in the Data Warehouse directory completed. When you delete a table, the data and metadata in the table are deleted at the same time.
External table has only one process, loading the data and creating the table at the same time (create External table ...). Location), the actual data is stored in the HDFS path specified behind the location and is not moved to the Data Warehouse directory. When you delete a External Table, only the hive metadata is deleted, and the corresponding file on HDFs is not deleted.

The understanding of Hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.