Reprinted from http://www.csdn.net/article/2010-11-28/282616
Hive Architecture
The structure of Hive
Mainly divided into the following parts:
User interface, including Cli,client,wui.
Metadata storage, usually stored in a relational database such as MySQL, Derby.
Interpreter, compiler, optimizer, actuator.
Hadoop: Use HDFS for storage and compute with MapReduce.
There are three main user interfaces: Cli,client and WUI. One of the most common is when CLI,CLI starts, it initiates a Hive copy at the same time. The client is the guest of hive, and the user connects to the Hive Server. When you start the Client mode, you need to indicate the node where the hive server is located and start hive Server on that node. WUI is a browser that accesses Hive.
Hive stores metadata in the database, such as MySQL, Derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and its properties, the properties of the table (whether it is an external table, etc.), the directory where the table's data resides, and so on.
The interpreter, compiler, optimizer completes HQL query statements from lexical analysis, parsing, compiling, optimization, and query plan generation. The generated query plan is stored in HDFS and subsequently executed by a MapReduce call.
Hive data is stored in HDFS, and most queries are completed by MapReduce (queries that contain *, such as SELECT * from TBL, do not generate mapredcue tasks).
Hive Meta data store
Hive stores metadata in an RDBMS with three modes to connect to the database:
Single User mode: This mode is connected to a in-memory database Derby, which is typically used for Unit Test.
Multi User mode: Connecting to a database over a network is the most frequently used mode.
Remote Server Mode: For non-Java client access metabase, a metastoreserver is started on the server side, and the client accesses the metabase through Metastoreserver using the Thrift protocol.
Data storage for Hive
First, Hive does not have a dedicated data storage format and does not index the data, and users can organize the tables in hive very freely, simply by telling the column separators and row separators in the hive data when creating the table, and hive can parse the data.
Second, all the data in hive is stored in HDFS, and hive contains the following data model: Table,external table,partition,bucket.
The table in hive is conceptually similar to the table in the database, and each table has a corresponding directory store data in hive. For example, a table PVS, its path in HDFS is:/wh/pvs, where WH is the directory of the data warehouse specified by ${hive.metastore.warehouse.dir} in Hive-site.xml, all table data (excluding Ex ternal Table) are stored in this directory.
Partition corresponds to a dense index of Partition columns in the database, but Partition in Hive are organized differently from the database. In Hive, a Partition in a table corresponds to a directory below the table, and all Partition data is stored in the corresponding directory. For example: PVs table contains DS and city two Partition, then corresponds to ds = 20090801, ctry = US HDFS subdirectory is:/wh/pvs/ds=20090801/ctry=us; corresponds to ds = 20090801 , ctry = The HDFS subdirectory for the CA;/wh/pvs/ds=20090801/ctry=ca
Buckets calculates the hash for the specified column, slicing the data according to the hash value, in order to parallel each Bucket corresponding to a file. Spread the user column to 32 buckets, first calculating the value of the user column hash, corresponding to a hash value of 0 of the HDFS directory is:/wh/pvs/ds=20090801/ctry=us/part-00000;hash value of 20 HD The FS directory is:/wh/pvs/ds=20090801/ctry=us/part-00020
External Table points to data that already exists in HDFS, you can create a Partition. It is the same as Table in the metadata organization, while the actual data is stored in a large difference.
Table creation and data loading process (both processes can be completed in the same statement), during the loading of data, the actual data will be moved to the Data Warehouse directory, and then the data pair access will be done directly in the Data Warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time.
External table has only one process, loading the data and creating the table at the same time (create External table ...). Location), the actual data is stored in the HDFS path specified behind the location and is not moved to the Data Warehouse directory.
Reproduced Hive structure