DDL operations
Create a table
Hive> Create Table pokes (FOO int, bar string );
Create a table and create an index field DS
Hive> Create Table invites (FOO int, bar string) partitioned by (DS string );
Show all tables
Hive> show tables;
Displays tables based on positive conditions (regular expressions,
Hive> show tables '. * s ';
Add a column to the table
Hive> alter table pokes add columns (new_col INT );
Add a column and add column field comments
Hive> alter table invites add columns (new_col2 int comment 'A comment ');
Change table name
Hive> alter table events Rename to 3 koobecaf;
Delete column
Hive> drop table pokes;
Metadata Storage
Load the data in the file into the table
Hive> load data local inpath'./examples/files/kv1.txt 'Overwrite into Table pokes;
Load local data and specify partition information
Hive> load data local inpath'./examples/files/kv2.txt 'Overwrite into Table invites partition (DS = '2017-08-15 ');
Load DFS data and specify partition information
Hive> load data inpath'/user/myname/kv2.txt 'Overwrite into Table invites partition (DS = '2017-08-15 ');
The above command will load data from an HDFS file/directory to the table. note that loading data from HDFS will result in moving the file/directory. as a result, the operation is almost instantaneous.
SQL operations
Query by presto
Hive> select a. Foo from invites a where a. DS = '<date> ';
Output query data to a directory
Hive> insert overwrite directory '/tmp/hdfs_out' select a. * From invites a where a. DS = '<date> ';
Output query results to local directory
Hive> insert overwrite local directory '/tmp/local_out' select a. * From pokes;
Select all columns to the local directory
Hive> insert overwrite table events select a. * from profiles;
Hive> insert overwrite table events select a. * from profiles a where a. Key <100;
Hive> insert overwrite local directory '/tmp/reg_3' select a. * from events;
Hive> insert overwrite directory '/tmp/reg_4' select a. Invites, A. Pokes from profiles;
Hive> insert overwrite directory '/tmp/reg_5' select count (1) From invites a where a. DS = '<date> ';
Hive> insert overwrite directory '/tmp/reg_5' select a. Foo, A. Bar from invites;
Hive> insert overwrite local directory '/tmp/Sum' select sum (A. PC) from pC1;
Insert the statistical results of one table into another table
Hive> from invites a insert overwrite table events select a. Bar, count (1) Where a. Foo> 0 group by A. bar;
Hive> insert overwrite table events select a. Bar, count (1) From invites a where a. Foo> 0 group by A. bar;
Join
Hive> from pokes T1 join invites T2 on (t1.bar = t2.bar) insert overwrite table events select t1.bar, t1.foo, t2.foo;
Insert multi-table data into the same table
From SRC
Insert overwrite table dest1 select SRC. * Where SRC. Key <100
Insert overwrite table dest2 select SRC. Key, SRC. value where SRC. Key> = 100 and SRC. Key <200
Insert overwrite table dest3 partition (DS = '1970-04-08 ', HR = '12') Select SRC. Key where SRC. Key> = 2008 and SRC. Key <200
Insert overwrite local directory '/tmp/dest4.out' select SRC. value where SRC. Key >=300;
Insert a file stream directly into the file
Hive> from invites a insert overwrite table events select transform (. foo,. bar) as (oof, Rab) using '/bin/cat' where. DS> '2017-08-09 ';
This streams the data in the map phase through the script/bin/CAT (like hadoop streaming). Similarly-streaming can be used on the reduce side (please see the hive tutorial or examples)
Actual Example
Create a table
Create Table u_data (
Userid int,
Movieid int,
Rating int,
Unixtime string)
Row format delimited
Fields terminated by '/t'
Stored as textfile;
Download the sample data file and decompress it.
Wget http://www.grouplens.org/system/files/ml-data.tar__0.gz
Tar xvzf ml-data.tar__0.gz
Load data to a table
Load data local inpath 'ml-data/U. data'
Overwrite into Table u_data;
Total statistics
Select count (1) From u_data;
Perform some complex data analysis
Create a weekday_mapper.py file, which is divided by week as data.
Import sys
Import datetime
For line in SYS. stdin:
Line = line. Strip ()
Userid, movieid, rating, unixtime = line. Split ('/t ')
Weekly information of generated data
Weekday = datetime. datetime. fromtimestamp (float (unixtime). isoweekday ()
Print '/t'. Join ([userid, movieid, rating, STR (weekday)])
Use the ing script
// Create a table and split the Field Values in the row by delimiter
Create Table u_data_new (
Userid int,
Movieid int,
Rating int,
Weekday INT)
Row format delimited
Fields terminated by '/T ';
// Load the python file to the System
Add file weekday_mapper.py;
Split data by week
Insert overwrite table u_data_new
Select
Transform (userid, movieid, rating, unixtime)
Using 'python weekday_mapper.py'
As (userid, movieid, rating, weekday)
From u_data;
Select weekday, count (1)
From u_data_new
Group by weekday;
Processing Apache weblog data
Combine Web logs with regular expressions and input them to the table based on the required conditions.
Add jar ../build/contrib/hive_contrib.jar;
Create Table apachelog (
Host string,
Identity string,
User string,
Time string,
Request string,
Status string,
Size String,
Referer string,
Agent string)
Row format serde 'org. Apache. hadoop. hive. contrib. serde2.regexserde'
With serdeproperties (
"Input. regEx "=" ([^] *) ([^] *) ([^] *) (-| // [[^/] * //]) ([^/"] * |/" [^/"] */") (-| [0-9] *) (-| [0-9] *) (? : ([^/"] * |/" [^/"] */") ([^/"] * |/" [^/"] */")? ",
"Output. format. string "=" % 1 $ S % 2 $ S % 3 $ S % 4 $ S % 5 $ S % 6 $ S % 7 $ S % 8 $ S % 9 $ s"
)
Stored as textfile;