Hive Architecture (i) architecture and basic composition

Last Update:2016-05-13 Source: Internet

Author: User

Tags metabase

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Architecture Partitioning of Hive

There are three main user interfaces: Cli,client and WUI. One of the most common is when CLI,CLI starts, it initiates a hive copy at the same time. The client is the guest of hive, and the user connects to the Hive Server. When you start the client mode, you need to indicate the node where the hive server is located and start hive Server on that node. Wui is a browser that accesses hive.
Hive stores metadata in the database, such as MySQL, Derby. The metadata in hive includes the name of the table, the columns and partitions of the table and its properties, the properties of the table (whether it is an external table, etc.), the directory where the table's data resides, and so on.
The interpreter, compiler, optimizer completes HQL query statements from lexical analysis, parsing, compiling, optimization, and query plan generation. The generated query plan is stored in HDFS and subsequently executed by a mapreduce call.
Hive data is stored in HDFs, and most queries and computations are done by MapReduce (queries that contain *, such as SELECT * from TBL, do not generate mapredcue tasks).

2 mode for connecting to a database 2.1 single-user mode

This mode connects to a in-memory database derby, which is typically used for unit Test.

2.2 Multi-user mode

Connecting to a database over a network is the most frequently used mode.

2.3 Remote Server Mode

For non-Java Client access metabase, Metastoreserver is started on the server side, and clients use the Thrift protocol to access the metabase through Metastoreserver.

3 Data Model of hive

For data storage, Hive does not have a dedicated data storage format and does not index the data, and users can organize the tables in hive very freely, simply by telling the column and row separators in the hive data when creating the table, and hive can parse the data.

All data in hive is stored in HDFs, and the storage structure consists primarily of databases, files, tables, and views.

Hive contains the following data models: Table internal tables, External table external tables, partition partitions, bucket buckets. Hive defaults to loading this file directly, and also supports sequence file, Rcfile.

3.1 Hive Database

Database, similar to traditional databases, is actually a table in a third-party database. Simple example command linehive > create database test_database;

3.2 Internal Tables

The internal table of hive is conceptually similar to the table in the database. Each table has a corresponding directory store data in hive. For example, a table PVS, which is the path in HDFs, /wh/pvs where WH is the directory of the ${hive.metastore.warehouse.dir} specified data warehouse in Hive-site.xml, all table data (not including external table) is stored in this directory. When you delete a table, both the metadata and the data are deleted.

Simple example of an internal table:

创建数据文件：test_inner_table.txt创建表：create table test_inner_table (key string)加载数据：LOAD DATA LOCAL INPATH ‘filepath‘ INTO TABLE test_inner_table查看数据：select * from test_inner_table;  select count(*) from test_inner_table删除表：drop table test_inner_table

3.3 External Tables

The external table points to data that already exists in HDFs, and you can create partition. It is the same as the internal table in the metadata organization, while the actual data storage has a large difference. The internal table creation process and the data loading process can be done independently or in the same statement, the actual data will be moved to the Data Warehouse directory during the loading of the data, and then the data access will be done directly in the Data Warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time. While the external table has only one procedure, loading the data and creating the table is done simultaneously (create EXTERNAL table ...). Location), the actual data is stored in the HDFS path specified behind the location and is not moved to the Data Warehouse directory. When you delete a external table, only the link is deleted.

Simple example of an external table:

创建数据文件：test_external_table.txt创建表：create external table test_external_table (key string)加载数据：LOAD DATA INPATH ‘filepath’ INTO TABLE test_inner_table查看数据：select * from test_external_table;  select count(*) from test_external_table删除表：drop table test_external_table

3.4 Partitioning

Partition corresponds to a dense index of partition columns in the database, but partition in hive are organized differently from the database. In Hive, a partition in a table corresponds to a directory below the table, and all partition data is stored in the corresponding directory.

For example, the PVs table contains DS and city two partition, which corresponds to ds = 20090801, ctry = US's HDFs subdirectory is/wh/pvs/ds=20090801/ctry=us; corresponds to ds = 20090801, ctry = CA The HDFs subdirectory is/WH/PVS/DS=20090801/CTRY=CA.

Partition Table Simple example:

 Create data file: Test_partition_table.txt CREATE TABLE: create  table  Test_partition_table (key  string) partitioned by  (dt string) load data: load  data inpath  ' filepath '  into  table  test_partition_table Partition (Dt= ' 2006 ' ) View data: select  * from  test_partition_table;  select  count  ( *) from  test_partition_table Delete table: drop  table  test_partition_table

3.5 bbl

Buckets is to further decompose the table columns into different file stores through the hash algorithm. It calculates the hash for the specified column, slicing the data according to the hash value, in order to parallel each bucket corresponding to a file.

For example, to scatter the user column to 32 buckets, first calculate the value of the user column hash, corresponding to a hash value of 0 HDFs directory is a/wh/pvs/ds=20090801/ctry=us/part-00000;hash value of 20 of the HDFs directory is /wh/pvs/ds=20090801/ctry=us/part-00020. This is a good choice if you want to apply a lot of map tasks.

A simple example of a bucket:

创建数据文件：test_bucket_table.txt创建表：create table test_bucket_table (key string) clustered by (key) into 20 buckets加载数据：LOAD DATA INPATH ‘filepath‘ INTO TABLE test_bucket_table查看数据：select * from test_bucket_table;  set hive.enforce.bucketing = true;

View of 3.6 Hive

Views are similar to views of traditional databases. The view is read-only, it is based on the base table, and if changed, the data increase does not affect the rendering of the view;

If you do not specify a column for the view, it is based on the build after the SELECT statement.
Example:

create view test_view as select * from test

Hive Architecture (i) architecture and basic composition

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More