Reproduced Hive structure

Source: Internet
Author: User
Tags metabase

Reprinted from http://www.csdn.net/article/2010-11-28/282616

Hive Architecture

The structure of Hive

Mainly divided into the following parts:

User interface, including Cli,client,wui.

Metadata storage, usually stored in a relational database such as MySQL, Derby.

Interpreter, compiler, optimizer, actuator.

Hadoop: Use HDFS for storage and compute with MapReduce.

There are three main user interfaces: Cli,client and WUI. One of the most common is when CLI,CLI starts, it initiates a Hive copy at the same time. The client is the guest of hive, and the user connects to the Hive Server. When you start the Client mode, you need to indicate the node where the hive server is located and start hive Server on that node. WUI is a browser that accesses Hive.

Hive stores metadata in the database, such as MySQL, Derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and its properties, the properties of the table (whether it is an external table, etc.), the directory where the table's data resides, and so on.

The interpreter, compiler, optimizer completes HQL query statements from lexical analysis, parsing, compiling, optimization, and query plan generation. The generated query plan is stored in HDFS and subsequently executed by a MapReduce call.

Hive data is stored in HDFS, and most queries are completed by MapReduce (queries that contain *, such as SELECT * from TBL, do not generate mapredcue tasks).

Hive Meta data store

Hive stores metadata in an RDBMS with three modes to connect to the database:

Single User mode: This mode is connected to a in-memory database Derby, which is typically used for Unit Test.

Multi User mode: Connecting to a database over a network is the most frequently used mode.

Remote Server Mode: For non-Java client access metabase, a metastoreserver is started on the server side, and the client accesses the metabase through Metastoreserver using the Thrift protocol.

Data storage for Hive

First, Hive does not have a dedicated data storage format and does not index the data, and users can organize the tables in hive very freely, simply by telling the column separators and row separators in the hive data when creating the table, and hive can parse the data.

Second, all the data in hive is stored in HDFS, and hive contains the following data model: Table,external table,partition,bucket.

The table in hive is conceptually similar to the table in the database, and each table has a corresponding directory store data in hive. For example, a table PVS, its path in HDFS is:/wh/pvs, where WH is the directory of the data warehouse specified by ${hive.metastore.warehouse.dir} in Hive-site.xml, all table data (excluding Ex ternal Table) are stored in this directory.

Partition corresponds to a dense index of Partition columns in the database, but Partition in Hive are organized differently from the database. In Hive, a Partition in a table corresponds to a directory below the table, and all Partition data is stored in the corresponding directory. For example: PVs table contains DS and city two Partition, then corresponds to ds = 20090801, ctry = US HDFS subdirectory is:/wh/pvs/ds=20090801/ctry=us; corresponds to ds = 20090801 , ctry = The HDFS subdirectory for the CA;/wh/pvs/ds=20090801/ctry=ca

Buckets calculates the hash for the specified column, slicing the data according to the hash value, in order to parallel each Bucket corresponding to a file. Spread the user column to 32 buckets, first calculating the value of the user column hash, corresponding to a hash value of 0 of the HDFS directory is:/wh/pvs/ds=20090801/ctry=us/part-00000;hash value of 20 HD The FS directory is:/wh/pvs/ds=20090801/ctry=us/part-00020

External Table points to data that already exists in HDFS, you can create a Partition. It is the same as Table in the metadata organization, while the actual data is stored in a large difference.

Table creation and data loading process (both processes can be completed in the same statement), during the loading of data, the actual data will be moved to the Data Warehouse directory, and then the data pair access will be done directly in the Data Warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time.

External table has only one process, loading the data and creating the table at the same time (create External table ...). Location), the actual data is stored in the HDFS path specified behind the location and is not moved to the Data Warehouse directory.

Reproduced Hive structure

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.