Hive (II.) –hive structure

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Dfs partition

Tags access client compiler create data data storage data warehouse directory

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The structure of the Hive, as shown in the diagram,

Mainly divided into the following parts:

user interface, including Cli,client,wui. Meta-data stores, typically stored in relational databases such as MySQL, Derby. Interpreter, compiler, optimizer, executor. Hadoop: Store with HDFS and compute using MapReduce. There are three main user interfaces: Cli,client and Wui. One of the most common is when the CLI,CLI is started and a Hive copy is started. The client is Hive, and the user is connected to Hive Server. When you start Client mode, you need to point out the Hive server's node and start Hive server on that node. Wui is a browser to access Hive. Hive stores metadata in a database, such as MySQL, Derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and its attributes, the properties of the table (whether it is an external table, etc.), the table's data directory, and so on. The interpreter, compiler, optimizer completes the HQL query statement from lexical analysis, parsing, compiling, optimizing, and query plan generation. The generated query plan is stored in HDFS and is subsequently executed by the MapReduce call. Hive data is stored in HDFS, and most queries are completed by MapReduce (including * queries, such as SELECT * from TBL do not generate mapredcue tasks). Hive Metadata Storage

Hive stores metadata in an RDBMS, there are three modes to connect to the database:

single User mode: This mode is connected to a as database Derby and is generally used for unit Test. Multi User mode: A network connection to a database is the most frequently used pattern. Remote Server Mode: For non-Java client access metabase, a metastoreserver is started on the server side, and clients use the Thrift protocol to access the metabase through Metastoreserver. Hive Data storage

First, Hive does not have a dedicated data storage format, and does not index the data, users can be very free to organize the table in Hive, just want to create a table to tell Hive data in the column separator and row separator, Hive can parse the data.

Second, all the data in the Hive is stored in HDFS, and Hive contains the following data models: Table,external Table,partition,bucket.

Table in Hive and tables in the database are conceptually similar, and each table has a corresponding directory store data in Hive. For example, a table PVS, where the path in HDFS is:/wh/pvs, where WH is the directory of the data warehouse specified by Hive-site.xml in ${hive.metastore.warehouse.dir, all table data (excluding External table) is saved in this directory. Partition corresponds to the dense index of the Partition column in the database, but Partition in Hive is organized differently from the database. In Hive, a Partition in a table corresponds to a table of contents, and all Partition data is stored in the corresponding directory. For example: The PVs table contains DS and city two Partition, then corresponds to ds = 20090801, ctry = US HDFS subdirectory is:/wh/pvs/ds=20090801/ctry=us; corresponds to ds = 20090801, ctry = The HDFS subdirectory of the CA is;/wh/pvs/ds=20090801/ctry=ca buckets computes the hash on the specified column, splitting the data according to the hash value, in order to be parallel, each Bucket corresponding to a file. The user column is dispersed to 32 bucket, first the hash is computed on the value of the user column, and the HDFS directory with a hash value of 0 is: The HDFS directory with the/wh/pvs/ds=20090801/ctry=us/part-00000;hash value of 20 is: /wh/pvs/ds=20090801/ctry=us/part-00020 External Table points to data that already exists in HDFS and can create Partition. It is the same as Table in the organization of the metadata, while the actual data storage is quite different. The creation process of the Table and the data loading process (which can be done in the same statement), the actual data is moved to the Data Warehouse directory during the loading of the data, and subsequent access to the data is done directly in the Data Warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time. External table has only one procedure, loading data and creating tables is done simultaneously (create External TInc. ... LOCATION), the actual data is stored in the HDFS path specified after LOCATION and is not moved to the Data Warehouse directory. When a External Table is deleted, only the

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More