Hive is a http://www.aliyun.com/zixun/aggregation/8302.html "> Data Warehouse infrastructure built on Hadoop." It provides a range of tools for data extraction, transformation, and loading, a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. Hive defines a simple class SQL query language, called QL, that allows users who are familiar with SQL to query data. As a data warehouse, hive Data management can be introduced from three aspects of metadata storage, data storage and data exchange according to the level of usage.
(1) meta-data storage
Hive stores metadata in an RDBMS, there are three modes to connect to the database:
Single User mode: This mode is connected to a as database Derby and is generally used for unit Test.
Multi User mode: A network connection to a database, which is the most common pattern.
Remote Server Mode: For non-Java client access metabase, a metastoreserver is started on the server side, and the client uses the Thrift protocol to access the metabase through Metastoreserver.
(2) Data storage
First, Hive does not have a dedicated data storage format or index to the data, and the user can organize the tables in Hive very freely, simply by telling the column and row delimiters in the Hive data when the table is created, and it can parse the data.
Second, all the data in Hive is stored in HDFS, and Hive contains 4 data models: Table, External table, Partition, Bucket.
Table in Hive and tables in the database are conceptually similar, and each table has a corresponding directory in Hive to store the data. For example, a table PVS, where the path in HDFS is:/wh/pvs, where WH is the directory of the data warehouse specified by Hive-site.xml in ${hive.metastore.warehouse.dir, all table data (excluding External table) is saved in this directory.
Partition corresponds to the dense index of Partition columns in the database, but Partition in Hive is organized differently from the database. In Hive, a Partition in a table corresponds to a table of contents, and all Partition data is stored in the corresponding directory. For example: The PVs table contains DS and city two Partition, then corresponds to the DS = 20090801, the HDFS subdirectory of city = US is:/wh/pvs/ds=20090801/city=us; corresponding to ds = 20090801, city = The HDFS subdirectory of the CA is:/wh/pvs/ds=20090801/city=ca.
Buckets calculates the hash on the specified column, splitting the data according to the hash value, in order to facilitate parallelism, each buckets corresponding to a file. Scatter the user column to 32 bucket, first compute the hash for the value of the user column, for example, the HDFS directory with a hash value of 0 is:/wh/pvs/ds=20090801/city=us/part-00000; corresponding hash value of 20 The HDFS directory is:/wh/pvs/ds=20090801/city=us/part-00020.
External Table points to data that already exists in HDFS and can create Partition. It is the same as the table in the organizational structure of the metadata, while there are significant differences in the storage of the actual data.
In the creation process of the table and the data loading process, which can be done in the same statement, the actual data is moved to the Data Warehouse directory. Subsequent access to the data will be done directly in the Data Warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time.
External table has only one procedure, because loading data and creating tables is done at the same time. The actual data is stored in the HDFS path specified after Location, and it is not moved to the Data Warehouse directory.
(3) Data exchange
Data interchange is mainly divided into the following sections (Figure 1-5):
User interface: Includes client, web interface, and database interface.
Meta-data storage: typically stored in relational databases, such as MySQL, Derby, and so on.
Interpreter, compiler, optimizer, executor.
Hadoop: Store with HDFS and compute using MapReduce.
There are three main user interfaces: client, Database interface and Web interface, the most common is client. The client is Hive, and when you start client mode, the user will want to connect to Hive server, and then you need to point out the Hive server's node and start Hive server on that node. The Web interface is accessed hive through a browser.
Hive stores metadata in a database, such as MySQL, Derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and its properties, the properties of the table (whether it is an external table, and so on), the directory where the table data resides, and so on.
Interpreter, compiler, optimizer completes HQL query statement from lexical analysis, parsing, compiling, optimizing to query plan generation. The generated query plan is stored in HDFS and is subsequently executed by the MapReduce call.
Hive data is stored in HDFS, and most queries are completed by MapReduce (the query containing * does not generate mapredcue tasks, such as SELECT * from TBL).
The data management of Hadoop is introduced from the Distributed File System HDFs of Hadoop, distributed database HBase and Data Warehouse tool hive, all of them realize the stereoscopic management of data from macroscopic to microcosmic through their own data definition and architecture. Completed large-scale data storage and task processing on the Hadoop platform.