Hive Summary (ix) Hive architecture

Source: Internet
Author: User
Tags hash join key string create database metabase
1. Hive architecture and basic composition the following is the schema diagram for hive. Figure 1.1 Hive Architecture
The architecture of hive can be divided into the following parts: (1) There are three main user interfaces: Cli,client and WUI. One of the most common is when CLI,CLI starts, it initiates a hive copy at the same time. The client is the guest of hive, and the user connects to the Hive Server. When you start the client mode, you need to indicate the node where the hive server is located and start hive Server on that node. Wui is a browser that accesses hive.
(2) Hive stores metadata in the database, such as MySQL, Derby. The metadata in hive includes the name of the table, the columns and partitions of the table and its properties, the properties of the table (whether it is an external table, etc.), the directory where the table's data resides, and so on.
(3) interpreter, compiler, optimizer completes HQL query statement from lexical analysis, parsing, compiling, optimization and query plan generation. The generated query plan is stored in HDFS and subsequently executed by a mapreduce call.
(4) Hive data is stored in HDFs, and most queries and computations are done by MapReduce (queries containing *, such as SELECT * from TBL, do not generate mapredcue tasks). Hive stores the metadata in an RDBMS,
There are three modes to connect to the database: (1) Single-user mode. This mode connects to a in-memory database derby, which is typically used for unit Test. Figure 2.1 Single-user mode
(2) Multi-user mode. Connecting to a database over a network is the most frequently used mode.

Figure 2.2 Multi-user mode

(3) remote server mode. For non-Java Client access metabase, Metastoreserver is started on the server side, and clients use the Thrift protocol to access the metabase through Metastoreserver.
For data storage, Hive does not have a dedicated data storage format and does not index the data, and users can organize the tables in hive very freely, simply by telling the column and row separators in the hive data when creating the table, and hive can parse the data. All data in hive is stored in HDFs, and the storage structure consists primarily of databases, files, tables, and views. Hive contains the following data models: Table internal tables, External table external tables, partition partitions, bucket buckets. Hive defaults to loading this file directly, and also supports sequence file, Rcfile.
The data model for hive is described below:
(1) The hive database is similar to a traditional database, which is actually a table in a third-party database. Simple example command line hive > CREATE Database test_database;
(2) The internal table of the internal table hive is conceptually similar to the table in the database. Each table has a corresponding directory store data in hive. For example, a table PVS, whose path in HDFs is/wh/pvs, where WH is the directory of the data warehouse specified by ${hive.metastore.warehouse.dir} in Hive-site.xml, All table data (not including external table) is stored in this directory.     When you delete a table, both the metadata and the data are deleted. Simple example of an internal table:
Create data file: Test_inner_table.txt CREATE TABLE Test_inner_table (key string)
Load data: Load information LOCAL inpath ' filepath ' into TABLE test_inner_table
View data: SELECT * from Test_inner_table; Select COUNT (*) from test_inner_table
Delete tables: Drop table test_inner_table
(3) External table external table points to data that already exists in HDFs, you can create partition. It is the same as the internal table in the metadata organization, while the actual data storage has a large difference. The internal table creation process and the data loading process can be done independently or in the same statement, the actual data will be moved to the Data Warehouse directory during the loading of the data, and then the data access will be done directly in the Data Warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time. While the external table has only one procedure, loading the data and creating the table is done simultaneously (create EXTERNAL table ...). Location), the actual data is stored in the HDFS path specified behind the location and is not moved to the Data Warehouse directory. When you delete a external table, only the link is deleted.
Simple example of an external table:
Create data file: Test_external_table.txt
Creating tables: Create external Table test_external_table (key string)
Load data: Load Inpath ' filepath ' into TABLE test_inner_table
View data: SELECT * from Test_external_table; Select COUNT (*) from test_external_table
Delete tables: Drop table test_external_table

(4) The partition partition corresponds to a dense index of the partition column in the database, but partition in hive is organized differently from the database. In Hive, a partition in a table corresponds to a directory below the table, and all partition data is stored in the corresponding directory. For example, the PVs table contains DS and city two partition, which corresponds to ds = 20090801, ctry = US's HDFs subdirectory is/wh/pvs/ds=20090801/ctry=us; corresponds to ds = 20090801, ctry = CA     The HDFs subdirectory is/WH/PVS/DS=20090801/CTRY=CA. Partition Table Simple example:
Create data file: Test_partition_table.txt
Creating tables: Create Table test_partition_table (key string) partitioned by (DT string)
Load data: Load Inpath ' filepath ' into TABLE test_partition_table partition (dt= ' 2006 ')
View data: SELECT * from Test_partition_table; Select COUNT (*) from test_partition_table
Delete tables: Drop table test_partition_table
(5) Bucket buckets is the table column through the hash algorithm further decomposed into different file storage. It calculates the hash for the specified column, slicing the data according to the hash value, in order to parallel each bucket corresponding to a file. For example, to scatter the user column to 32 buckets, first calculate the value of the user column hash, corresponding to a hash value of 0 HDFs directory is a/wh/pvs/ds=20090801/ctry=us/part-00000;hash value of 20 of the HDFs directory is /wh/pvs/ds=20090801/ctry=us/part-00020. This is a good choice if you want to apply a lot of map tasks.
A simple example of a bucket:
Create data file: Test_bucket_table.txt
Creating tables: Create Table test_bucket_table (key string) clustered by (key) to buckets
Load data: Load Inpath ' filepath ' into TABLE test_bucket_table
View data: SELECT * from Test_bucket_table; Set hive.enforce.bucketing = true;

(6) View of Hive
Views are similar to views of traditional databases. The view is read-only, it is based on the base table, and if changed, the data increase does not affect the rendering of the view; • If you do not specify a column for a view, it is based on the build after the SELECT statement.
Example: CREATE VIEW Test_view as SELECT * FROM Test
2, the implementation of hive principle

Figure 2.1 How hive is executed
Hive is built on top of Hadoop,
(1) The Interpretation, optimization, and generation of query statements in HQL is done by hive.
(2) All data is stored in Hadoop
(3) The query plan is converted into a mapreduce task, executed in Hadoop (some queries do not have Mr Tasks, such as: SELECT * from table)
(4) Both Hadoop and hive use the UTF-8 encoded hive compiler to convert a hive QL operator. The operator operator is the smallest processing unit of hive, and each operator represents an operation of HDFs or a mapreduce job. operator is a processing process defined by the hive, which is defined as:
protected List <operator<? Extends Serializable >> childoperators;
protected List <operator<? Extends Serializable >> parentoperators;
protected Boolean done; Initialization value false all operations constitute the operator diagram, which is based on these graph relationships to handle operations such as limit, group by, join, and so on. Figure 2.2 The operator of Hive QL
The operators are as follows:
Tablescanoperator: Scan Hive table Data
Reducesinkoperator: Create a <Key,Value> pair that will be sent to the reducer end
Joinoperator:join two data
Selectoperator: Select Output column
Filesinkoperator: Set up result data, output to file
Filteroperator: Filter input data
GROUPBYOPERATOR:GROUPBY statements
Mapjoinoperator:/*+mapjoin (t) */
Limitoperator:limit statements
Unionoperator:union statements
Hive performs the MapReduce task through Execmapper and Execreducer.     There are two modes, local mode and distributed mode, in the execution of MapReduce. The hive compiler consists of:


Figure 2.3 The composition of the hive compiler
The compilation process is as follows:

Figure 2.4 Hive QL Compilation process



3. The similarities and differences between hive and database
Because hive uses the SQL query Language hql, it is easy to interpret hive as a database. In fact, in terms of structure, hive and database have similar query language, no similarity. The database can be used in online applications, but hive is designed for the data warehouse, which is clear and helps to understand the characteristics of hive from an application perspective.
The following table compares hive and database:

Hive
Rdbms
Query Language HQL Sql
Data storage Hdfs Raw Device or Local FS
Data format User Defined System decision
Data Update Not supported Support
Index No Yes
Perform Mapreduce Executor
Execution delay High Low
Processing data size Big Small
Scalability High Low

(1) Query language. Because SQL is widely used in the Data Warehouse, the query Language hql of class SQL is designed specifically for the characteristics of hive. Developers who are familiar with SQL development can easily use hive for development.
(2) data storage location. Hive is built on top of Hadoop, and all hive data is stored in HDFs. The database can then store the data in a block device or on a local file system.
(3) data format. There is no specific data format defined in hive, the data format can be specified by the user, and the user-defined data format requires three attributes: Column delimiter (usually space, "\ T", "\x001″"), line delimiter ("\ n") ) and how to read the file data (the default in hive is three file formats Textfile,sequencefile and Rcfile). Because in the process of loading the data, there is no need to convert from the user data format to the data format defined by the hive, so hive does not make any modifications to the data itself during the loading process, but simply copies or moves the contents of the data into the appropriate HDFS directory. In the database, different databases have different storage engines and define their own data formats. All data is stored in a certain organization, so the process of loading data in a database can be time-consuming.
(4) Data update. Because hive is designed for data warehouse applications, the content of the Data warehouse is much less read and write. Therefore, overwriting and adding data is not supported in hive, and all data is determined when loaded. The data in the database is often modified, so you can use INSERT INTO ... Values add data, use update ... Set to modify the data.
(5) Index. As has been said before, Hive does not do any processing of the data during the loading of the data, or even scans the data, and therefore does not index some of the keys in the data. When hive accesses a specific value in the data that satisfies a condition, it requires brute-force scanning of the entire data, so the access latency is high. Because of the introduction of MapReduce, hive can access the data in parallel, so even without an index, hive can still demonstrate its advantage in accessing large amounts of data. Database, it is usually indexed for one or several columns, so the database can be highly efficient and low latency for data access to a small number of specific conditions. Because of the high latency of data access, it is decided that hive is not suitable for online data query.
(6) execution. The execution of most queries in hive is done through the MapReduce provided by Hadoop (queries like select * from TBL do not require MapReduce). The database usually has its own execution engine.
(7) Execution delay. As mentioned before, Hive, when querying data, needs to scan the entire table because there is no index, so the delay is higher

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.