First, Hive Overview and Architecture
What is 1.Hive?
(1). Open Source by Facebook, originally used to solve the massive structural log data statistics problem
(2). is a data warehouse built on top of Hadoop
(3). Hive defines a language similar to SQL query: HQL (very similar to SQL statements in MySQL, and extended at the same time)
(4). Typically used for offline data processing (with MapReduce)
(5). Can be thought of as a Hql=>mapreduce language translator
(6). The underlying supports a number of different execution engines (by default, MapReduce)
(7). Support for different compression formats, storage formats, and custom functions
(8). The database and table in hive is the directory/folder in HDFs, the data is a file, the metadata information can be stored in any relational database (for example: MySQL, SQL Server, Oracle, etc., by default, Derby), the data is stored in HDFs 2. The architecture of Hive
The above image is a hive architecture diagram, which consists of the following parts
user interface: Client
Cli (command-line shell), and shell command line, hive1 client, hive Server2 provides new commands beeline
JDBC/ODBC, which accesses hive through Java, with traditional database JDBC mode types, such as we use Java to access MySQL through JDBC
WebUI, accessing hive in the browser
Meta data: Metastore
Hive stores metadata in a database (Metastore), which can be any of a relational database, including: Table name, database to which the table belongs, owner of the table, column/partition field, type of table, table data directory.
Drive: Driver
Includes: parser, compiler, optimizer, actuator
Complete the HQL query statement from lexical analysis, parsing, compiling, optimization, and query plan generation, the resulting query plan is stored in HDFS and subsequently executed by the MapReduce call
Hadoop
Use HDFs for data storage, run on yarn, calculate 3.Hive deployment architecture using MapReduce 1. Test environment
Hive is a data warehouse built on top of Hadoop, in practice, our Hadoop is definitely a cluster, but hive we only need to be installed on a service, metadata is stored in Derby by default, Derby is a single user, it is not very convenient to use, We recommend that we install a MySQL database that is dedicated to storing hive metadata information. 2. Production environment
In the production environment is not the same as the test environment is the MySQL with the main standby, when a MySQL service hangs, through some mechanism it will automatically switch to another, in fact, the benefits of this is to fault tolerance, ensure high availability, prevent hive metadata loss 4. The difference from a relational database
Advantages and disadvantages of hive
Advantages:
Easy to use and simple to use, much simpler than programming with MapReduce
Data offline processing, such as log analysis, massive data structure analysis
Low-level Hadoop-based, easy to scale, support for custom function UDFs
Disadvantages:
Hive execution latency is high and is not suitable for real-time queries
The advantage of hive is that it handles big data sets and has no advantage for small datasets. Environment Building
Metadata in hive is stored by default in Derby, Derby is single-user, inconvenient to use, we use MySQL to store hive metadata information, metadata includes table name, table's column and partition and its attributes, table's properties (whether external table, etc.), table data directory, etc. Since hive metadata needs to be constantly updated and modified, and the files in the HDFS system are read-less, it is clear that hive metadata cannot be stored in HDFs. 1. Install MySQL
Installing MySQL directly with the Yum-y Install command will not install the advanced version of MySQL, you need to first install the RPM package with the available MYSQL5 Series Community Edition resources, go to the MySQL website to download
MySQL official website: https://dev.mysql.com/downloads/file/?id=470281
MySQL Installation Tutorial official website: https://dev.mysql.com/doc/mysql-yum-repo-quick-guide/en/
After downloading, use Xshell to upload to Linux, and enter the following command to install
$ sudo rpm-uvh mysql57-community-release-el7-11.noarch.rpm
After the installation is complete, review the MySQL installation version list
Install MySQL, just install Mysql-community-server, yum will automatically detect the installation of dependent packages
$ sudo yum-y install Mysql-community-server
After the installation is complete, start MySQL
$ sudo systemctl start mysqld
After the MySQL installation is complete, there will be an initial password, enter the following command to view
$ sudo grep ' temporary password '/var/log/mysqld.log
Sign in with the password you just looked up
$ mysql-uroot-p
Change Password
Note that this version of MySQL has enhanced user password security, so the password must be set to include at least one number, one uppercase letter, one lowercase and one special symbol combination, with a password length of at least 8 characters.
ALTER USER ' root ' @ ' localhost ' identified by ' Enter your password ';
After the modification is complete, enter exit to exit the Shell interface and log in with the new password 2. Install Hive
Download Hive's installation package from the official website, address: http://apache.fayea.com/hive/
The author download version is 2.3.0, upload to Linux after download, enter the following command to extract
$ TAR-ZVXF apache-hive-2.3.0-bin.tar.gz
Configuring the Hive-site.xml file
Into Hive's conf directory, copy file hive-default.xml.template named Hive-site.xml
$ cd/home/hadoop/hive
$ cp hive-default.xml.template Hive-site.xml
Create a temporary directory for hive
$ mkdir tmp
Hive-site.xml is configured as follows, please refer to hive website for more information
<configuration> <property> <name>hive.exec.local.scratchdir</name> <VALUE&G t;/home/hadoop/hive/tmp/hive</value> </property> <property> <name>hive.downloaded
.resources.dir</name> <value>/home/hadoop/hive/tmp/${hive.session.id}_resources</value> <description>hive download resources for storage directory </description> </property> <property> <name>hive. Querylog.location</name> <value>/home/hadoop/hive/tmp/hive</value> <description>hi ve structure log storage directory </description> </property> <property> <name>hive.server2.logging.operat Ion.log.location</name> <value>/home/hadoop/hive/tmp/hive/operation_logs</value> <des Cription>hive Operation log directory </description> </property> <property> <NAME>HIVE.EXEC.SCRA
Tchdir</name> <value>/tmp/hive</value> <description>hive is used to store the execution plan for different stages map/reduce and to store the contents of intermediate output </descriptio n> </property> <property> <name>hive.scratch.dir.permission</name> &L
t;value>700</value> <description>hive access to Hadoop file permissions settings </description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>p@ssw0rd</va lue> <description> Metadata Store password for the database </description> </property> <property> &L T;name>javax.jdo.option.connectionusername</name> <value>root</value> <description > User name for the metadata store database </description> </property> <property> <name>javax.jdo.option.conn Ectionurl</name> <value>jdbc:mysql://localhost:3306/hive_metadata?createdatabaseifnotexist=true &characterencoding=utf8</value> <description> url</description> of database for metadata storage </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.driver</value&
Gt <description> Drive </description> </property> <property> <name>datanu for a database of metadata stores Cleus.schema.autocreateall</name> <value>true</value> <description> turn on auto-creation (not recommended in production environments) Use) </description> </property> <property> <name>hive.metastore.schema.verification </name> <value>false</value></description> </property> <!------------configuration
Hive Transaction Start------------> <property> <name>hive.optimize.sort.dynamic.partition</name> <value>false</value> </property> <property> <name>hive.support.concu rrency</name> <value>true</value> </property> <property> <name>hive.enforc e.bucketing</name> <value>true</value> </property> <property> <na
Me>hive.exec.dynamic.partition.mode</name> <value>nonstrict</value> </property> <property> <name>hive.txn.manager</name> <value>org.apache.hadoop.hive.ql.lockmgr . dbtxnmanager</value> </property> <property> <name>hive.compactor.initiator.on</ name> <value>true</value> </property> <property> <name>hive.compa ctor.worker.threads</name> <value>1</value> </property> <property> & Lt;name>hive.in.test</name> <value>true</value> </property> <!------------with The transaction end of Hive-----------<property> <name>hive.cli.print.current.db</name> <value>true</valu E> </property> </configuration>
Note: The MySQL driver package needs to be copied to the Lib directory of the hive installation directory before booting, otherwise the following error will be reported at startup
Datastoredrivernotfoundexception:the specified datastore
Driver ("Com.mysql.jdbc.Driver") was wasn't found in the CLASSPATH
I installed the hive version of 2.3.0, both support hive1, also support hive2,hive1 and 2 boot is not the same, respectively, as follows
Hive1
$ hive--service Metastore # executes at first startup and will automatically bind
$ hive
hive2
$ hiveserver2
$ beeline-u jdbc at a later time: hive2://
After successful startup, you can see that you have created a database Hive_metadata (configured in the Hive-site.xml parameter Javax.jdo.option.ConnectionURL)
View Database Hive_metadata
After Hive starts successfully, the following files are generated in the temporary directory of the Hive
$ cd/home/hadoop/hive/tmp/
Input hive (HIVE1) enters the client, view all databases, default database defaults