Hive Overview Architecture and Environment building

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, Hive Overview and Architecture What is 1.Hive?

(1). Open Source by Facebook, originally used to solve the massive structural log data statistics problem
(2). is a data warehouse built on top of Hadoop
(3). Hive defines a language similar to SQL query: HQL (very similar to SQL statements in MySQL, and extended at the same time)
(4). Typically used for offline data processing (with MapReduce)
(5). Can be thought of as a Hql=>mapreduce language translator
(6). The underlying supports a number of different execution engines (by default, MapReduce)
(7). Support for different compression formats, storage formats, and custom functions
(8). The database and table in hive is the directory/folder in HDFs, the data is a file, the metadata information can be stored in any relational database (for example: MySQL, SQL Server, Oracle, etc., by default, Derby), the data is stored in HDFs 2. The architecture of Hive

The above image is a hive architecture diagram, which consists of the following parts

user interface: Client
Cli (command-line shell), and shell command line, hive1 client, hive Server2 provides new commands beeline
JDBC/ODBC, which accesses hive through Java, with traditional database JDBC mode types, such as we use Java to access MySQL through JDBC
WebUI, accessing hive in the browser

Meta data: Metastore
Hive stores metadata in a database (Metastore), which can be any of a relational database, including: Table name, database to which the table belongs, owner of the table, column/partition field, type of table, table data directory.

Drive: Driver
Includes: parser, compiler, optimizer, actuator
Complete the HQL query statement from lexical analysis, parsing, compiling, optimization, and query plan generation, the resulting query plan is stored in HDFS and subsequently executed by the MapReduce call

Hadoop
Use HDFs for data storage, run on yarn, calculate 3.Hive deployment architecture using MapReduce 1. Test environment

Hive is a data warehouse built on top of Hadoop, in practice, our Hadoop is definitely a cluster, but hive we only need to be installed on a service, metadata is stored in Derby by default, Derby is a single user, it is not very convenient to use, We recommend that we install a MySQL database that is dedicated to storing hive metadata information. 2. Production environment

In the production environment is not the same as the test environment is the MySQL with the main standby, when a MySQL service hangs, through some mechanism it will automatically switch to another, in fact, the benefits of this is to fault tolerance, ensure high availability, prevent hive metadata loss 4. The difference from a relational database

Advantages and disadvantages of hive
Advantages:
Easy to use and simple to use, much simpler than programming with MapReduce
Data offline processing, such as log analysis, massive data structure analysis
Low-level Hadoop-based, easy to scale, support for custom function UDFs

Disadvantages:
Hive execution latency is high and is not suitable for real-time queries
The advantage of hive is that it handles big data sets and has no advantage for small datasets. Environment Building

Metadata in hive is stored by default in Derby, Derby is single-user, inconvenient to use, we use MySQL to store hive metadata information, metadata includes table name, table's column and partition and its attributes, table's properties (whether external table, etc.), table data directory, etc. Since hive metadata needs to be constantly updated and modified, and the files in the HDFS system are read-less, it is clear that hive metadata cannot be stored in HDFs. 1. Install MySQL

Installing MySQL directly with the Yum-y Install command will not install the advanced version of MySQL, you need to first install the RPM package with the available MYSQL5 Series Community Edition resources, go to the MySQL website to download

MySQL official website: https://dev.mysql.com/downloads/file/?id=470281
MySQL Installation Tutorial official website: https://dev.mysql.com/doc/mysql-yum-repo-quick-guide/en/

After downloading, use Xshell to upload to Linux, and enter the following command to install

$ sudo rpm-uvh mysql57-community-release-el7-11.noarch.rpm

After the installation is complete, review the MySQL installation version list

Install MySQL, just install Mysql-community-server, yum will automatically detect the installation of dependent packages

$ sudo yum-y install Mysql-community-server

After the installation is complete, start MySQL

$ sudo systemctl start mysqld

After the MySQL installation is complete, there will be an initial password, enter the following command to view

$ sudo grep ' temporary password '/var/log/mysqld.log

$ mysql-uroot-p

Change Password

Note that this version of MySQL has enhanced user password security, so the password must be set to include at least one number, one uppercase letter, one lowercase and one special symbol combination, with a password length of at least 8 characters.

ALTER USER ' root ' @ ' localhost ' identified by ' Enter your password ';

After the modification is complete, enter exit to exit the Shell interface and log in with the new password 2. Install Hive

Download Hive's installation package from the official website, address: http://apache.fayea.com/hive/

The author download version is 2.3.0, upload to Linux after download, enter the following command to extract

$ TAR-ZVXF apache-hive-2.3.0-bin.tar.gz

Configuring the Hive-site.xml file

Into Hive's conf directory, copy file hive-default.xml.template named Hive-site.xml

$ cd/home/hadoop/hive
$ cp hive-default.xml.template Hive-site.xml

Create a temporary directory for hive

$ mkdir tmp

Hive-site.xml is configured as follows, please refer to hive website for more information

<configuration> <property> <name>hive.exec.local.scratchdir</name> &LT;VALUE&G t;/home/hadoop/hive/tmp/hive</value> </property> <property> <name>hive.downloaded
        .resources.dir</name> <value>/home/hadoop/hive/tmp/${hive.session.id}_resources</value> <description>hive download resources for storage directory </description> </property> <property> <name>hive. Querylog.location</name> <value>/home/hadoop/hive/tmp/hive</value> <description>hi ve structure log storage directory </description> </property> <property> <name>hive.server2.logging.operat Ion.log.location</name> <value>/home/hadoop/hive/tmp/hive/operation_logs</value> <des Cription>hive Operation log directory </description> </property> <property> &LT;NAME&GT;HIVE.EXEC.SCRA
    Tchdir</name>    <value>/tmp/hive</value> <description>hive is used to store the execution plan for different stages map/reduce and to store the contents of intermediate output </descriptio n> </property> <property> <name>hive.scratch.dir.permission</name> &L

    t;value>700</value> <description>hive access to Hadoop file permissions settings </description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>p@ssw0rd</va lue> <description> Metadata Store password for the database </description> </property> <property> &L T;name>javax.jdo.option.connectionusername</name> <value>root</value> <description > User name for the metadata store database </description> </property> <property> <name>javax.jdo.option.conn Ectionurl</name> <value>jdbc:mysql://localhost:3306/hive_metadata?createdatabaseifnotexist=true &amp;characterencoding=utf8</value> <description> url</description> of database for metadata storage </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.driver</value&
        Gt <description> Drive </description> </property> <property> <name>datanu for a database of metadata stores Cleus.schema.autocreateall</name> <value>true</value> <description> turn on auto-creation (not recommended in production environments) Use) </description> </property> <property> <name>hive.metastore.schema.verification </name> <value>false</value></description> </property> <!------------configuration
        Hive Transaction Start------------> <property> <name>hive.optimize.sort.dynamic.partition</name> <value>false</value> </property> <property> <name>hive.support.concu rrency</name> <value>true</value> </property> <property> <name>hive.enforc e.bucketing</name> <value>true</value> </property> <property> <na
    Me>hive.exec.dynamic.partition.mode</name> <value>nonstrict</value> </property> <property> <name>hive.txn.manager</name> <value>org.apache.hadoop.hive.ql.lockmgr . dbtxnmanager</value> </property> <property> <name>hive.compactor.initiator.on</ name> <value>true</value> </property> <property> <name>hive.compa ctor.worker.threads</name> <value>1</value> </property> <property> & Lt;name>hive.in.test</name> <value>true</value> </property> <!------------with The transaction end of Hive-----------<property> <name>hive.cli.print.current.db</name> <value>true</valu E> </property> </configuration>

Note: The MySQL driver package needs to be copied to the Lib directory of the hive installation directory before booting, otherwise the following error will be reported at startup

Datastoredrivernotfoundexception:the specified datastore
Driver ("Com.mysql.jdbc.Driver") was wasn't found in the CLASSPATH

I installed the hive version of 2.3.0, both support hive1, also support hive2,hive1 and 2 boot is not the same, respectively, as follows

Hive1
$ hive--service Metastore # executes at first startup and will automatically bind
$ hive

hive2
$ hiveserver2
$ beeline-u jdbc at a later time: hive2://

After successful startup, you can see that you have created a database Hive_metadata (configured in the Hive-site.xml parameter Javax.jdo.option.ConnectionURL)

View Database Hive_metadata

After Hive starts successfully, the following files are generated in the temporary directory of the Hive

$ cd/home/hadoop/hive/tmp/

Input hive (HIVE1) enters the client, view all databases, default database defaults

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More