The structure of the Hive, as shown in the diagram,
Mainly divided into the following parts:
user interface, including Cli,client,wui. Meta-data stores, typically stored in relational databases such as MySQL, Derby. Interpreter, compiler, optimizer, executor. Hadoop: Store with HDFS and compute using MapReduce. There are three main user interfaces: Cli,client and Wui. One of the most common is when the CLI,CLI is started and a Hive copy is started. The client is Hive, and the user is connected to Hive Server. When you start Client mode, you need to point out the Hive server's node and start Hive server on that node. Wui is a browser to access Hive. Hive stores metadata in a database, such as MySQL, Derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and its attributes, the properties of the table (whether it is an external table, etc.), the table's data directory, and so on. The interpreter, compiler, optimizer completes the HQL query statement from lexical analysis, parsing, compiling, optimizing, and query plan generation. The generated query plan is stored in HDFS and is subsequently executed by the MapReduce call. Hive data is stored in HDFS, and most queries are completed by MapReduce (including * queries, such as SELECT * from TBL do not generate mapredcue tasks). Hive Metadata Storage
Hive stores metadata in an RDBMS, there are three modes to connect to the database:
single User mode: This mode is connected to a as database Derby and is generally used for unit Test. Multi User mode: A network connection to a database is the most frequently used pattern. Remote Server Mode: For non-Java client access metabase, a metastoreserver is started on the server side, and clients use the Thrift protocol to access the metabase through Metastoreserver. Hive Data storage
First, Hive does not have a dedicated data storage format, and does not index the data, users can be very free to organize the table in Hive, just want to create a table to tell Hive data in the column separator and row separator, Hive can parse the data.
Second, all the data in the Hive is stored in HDFS, and Hive contains the following data models: Table,external Table,partition,bucket.
Table in Hive and tables in the database are conceptually similar, and each table has a corresponding directory store data in Hive. For example, a table PVS, where the path in HDFS is:/wh/pvs, where WH is the directory of the data warehouse specified by Hive-site.xml in ${hive.metastore.warehouse.dir, all table data (excluding External table) is saved in this directory. Partition corresponds to the dense index of the Partition column in the database, but Partition in Hive is organized differently from the database. In Hive, a Partition in a table corresponds to a table of contents, and all Partition data is stored in the corresponding directory. For example: The PVs table contains DS and city two Partition, then corresponds to ds = 20090801, ctry = US HDFS subdirectory is:/wh/pvs/ds=20090801/ctry=us; corresponds to ds = 20090801, ctry = The HDFS subdirectory of the CA is;/wh/pvs/ds=20090801/ctry=ca buckets computes the hash on the specified column, splitting the data according to the hash value, in order to be parallel, each Bucket corresponding to a file. The user column is dispersed to 32 bucket, first the hash is computed on the value of the user column, and the HDFS directory with a hash value of 0 is: The HDFS directory with the/wh/pvs/ds=20090801/ctry=us/part-00000;hash value of 20 is: /wh/pvs/ds=20090801/ctry=us/part-00020 External Table points to data that already exists in HDFS and can create Partition. It is the same as Table in the organization of the metadata, while the actual data storage is quite different. The creation process of the Table and the data loading process (which can be done in the same statement), the actual data is moved to the Data Warehouse directory during the loading of the data, and subsequent access to the data is done directly in the Data Warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time. External table has only one procedure, loading data and creating tables is done simultaneously (create External TInc. ... LOCATION), the actual data is stored in the HDFS path specified after LOCATION and is not moved to the Data Warehouse directory. When a External Table is deleted, only the