1, about Hive
Hive is a Hadoop-based data Warehouse platform. With hive, we can easily work with ETL. Hive defines a SQL-like query language: HQL, which converts a user-written QL into a corresponding MapReduce program based on Hadoop execution.
Hive is a data Warehouse framework that Facebook has just open source for August 2008, and its system targets are similar to pig, but there are mechanisms that pig does not currently support, such as richer type systems, more SQL-like query languages, table/ Partition the persistence of metadata.
The text of this text connection is: http://blog.csdn.net/freewebsys/article/details/47617975 not allowed to reprint without the Bo master.
Home page:
http://hive.apache.org/
2, installation
First you install Hadoop
https://hadoop.apache.org/
Download tar.gz unzip directly. Latest Version 2.7.1.
tar -zxvf hadoop-2.7.1.tar.gzmv hadoop-2.7.1 hadoop
:
Http://hive.apache.org/downloads.html
It can be extracted directly. Latest Version 1.2.1.
-zxvf apache-hive-1.2.1-bin.tar.gz mv apache-hive-1.2.1 apache-hive
Set Environment variables:
export JAVA_HOME=/usr/java/defaultexport CLASS_PATH=$JAVA_HOME/libexport PATH=$JAVA_HOME/bin:$PATHexport HADOOP_HOME=/data/hadoopexport PATH=$HADOOP_HOME/bin:$PATHexport HIVE_HOME=/data/apache-hiveexport PATH=$HIVE_HOME/bin:$PATH
3, start hive, create table
Hive Official Website: https://cwiki.apache.org/confluence/display/Hive/Home
Configure the environment variable to start hive, which is a native environment that relies only on Hadoop and only has the HADOOP environment variable.
Create a data table, very similar to MySQL
Reference: http://www.uml.org.cn/yunjisuan/201409235.asp
Https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
# hiveLogging initializedusingConfigurationinchjar:file:/data/apache-hive/Lib/hive-common-1.2. 1.jar!/hive-log4j.propertieshive> Show Databases;okdefaultTime Taken:1.284Seconds, fetched:1Row (s) hive> usedefault; Oktime taken:0.064Secondshive> Show Tables;oktime taken:0.051Secondshive> CREATE TABLE user_info (uid int,nameSTRING) > Partitioned by(create_dateSTRING) > ROW FORMAT delimited fields TERMINATED by ', '> STORED asTextfile;oktime taken:0.09Seconds
You may encounter problems when you create a database table using Apache hive:
line5:2to‘date‘‘identifier‘in column specification
The description of the keyword conflicts. You can't use keywords such as date,user.
When you specify the storage format as Sequencefile, the data in TXT format is imported into the table, and hive reports the file format is wrong.
withfilethefilereturn1from org.apache.hadoop.hive.ql.exec.MoveTask
4, Import data
Hive does not support inserting a single line of INSERT statements, nor does IT support update operations. The data is loaded into the built-in table in load mode.
Once the data is imported, it cannot be modified. Because Hadoop is this feature.
Create two data files:
/data/user_info_data1.txt
121,zhangsan1
122,zhangsan2
123,zhangsan3
/data/user_info_data2.txt
124,zhangsan4
125,zhangsan5
126,zhangsan6
Data import: Import data into two partitions, respectively.
Hive>LOAD DATA LOCAL Inpath‘/ data/user_info_data1.txt ' OVERWRITE into TABLE user_info PARTITION (create_date= ' 20150801 ');Loading data to table Default.user_info partition (create_date=20150801) Partition default. user_info{create_date=20150801} stats: [Numfiles=1, Numrows=0, totalsize=42, rawdatasize=0] OK TimeTaken:0.762SecondsHive>LOAD DATA LOCAL Inpath‘/ data/user_info_data2.txt ' OVERWRITE into TABLE user_info PARTITION (create_date= ' 20150802 ');Loading data to table Default.user_info partition (create_date=20150802) Partition default. user_info{create_date=20150802} stats: [Numfiles=1, Numrows=0, totalsize=42, rawdatasize=0] OK TimeTaken:0.403Seconds
5, query
Direct query can be.
selectfromwhere20150801;OK121 zhangsan1 20150801122 zhangsan2 20150801123 zhangsan3 201508010.0993 row(s)
More query function references:
Hive function Encyclopedia and user-defined functions
Https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
6, summary
The text of this text connection is: http://blog.csdn.net/freewebsys/article/details/47617975 not allowed to reprint without the Bo master.
Hive can be very convenient for offline data statistics, because once the data entry can not be modified.
Hive's syntax is very similar to MySQL, and can be used to make full use of Hadoop for data statistics and join multiple times without worrying about efficiency issues.
Currently, there is a small problem that is not resolved, that is, the data import must use Textfile, not the compressed file type.
The specific description of this problem is referenced by:
http://blog.163.com/[email protected]/blog/static/6797953420128118227663/
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Hadoop (1): CentOS installation Hadoop & Hive