Alex 的 Hadoop 菜鳥教程: 第10課 Hive 入門教程，hadoophive

最後更新：2014-12-09 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

Alex 的 Hadoop 菜鳥教程: 第10課 Hive 入門教程，hadoophive

Hive 安裝

相比起很多教程先介紹概念，我喜歡先動手裝上，然後用例子來介紹概念。我們先來安裝一下Hive

先確認是否已經安裝了對應的yum源，如果沒有照這個教程裡面寫的安裝cdh的yum源http://blog.csdn.net/nsrainbow/article/details/36629339

Hive是什麼

Hive 提供了一個讓大家可以使用sql去查詢資料的途徑。但是最好不要拿Hive進行即時的查詢。因為Hive的實現原理是把sql語句轉化為多個Map Reduce任務所以Hive非常慢，官方文檔說Hive 適用於高延時性的情境而且很費資源。

舉個簡單的例子，可以像這樣去查詢

hive> select * from h_employee;OK11peter22paulTime taken: 9.289 seconds, Fetched: 2 row(s)

這個h_employee不一定是一個資料庫表

metastore

Hive 中建立的表都叫metastore表。這些表並不真實的儲存資料，而是定義真實資料跟hive之間的映射，就像傳統資料庫中表的meta資訊，所以叫做metastore。實際儲存的時候可以定義的儲存模式有四種：

內部表（預設）
分區表
桶表
外部表格

舉個例子，這是一個簡曆內部表的語句

CREATE TABLE worker(id INT, name STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';

這個語句的意思是建立一個worker的內部表，內部表是預設的類型，所以不用寫儲存的模式。並且使用逗號作為分隔字元儲存
建表語句支援的類型基礎資料型別 (Elementary Data Type)
tinyint / smalint / int /bigint
float / double
boolean
string

複雜資料類型
Array/Map/Struct

沒有date /datetime

建完的表存在哪裡呢？在 /user/hive/warehouse 裡面，可以通過hdfs來查看建完的表位置

$ hdfs dfs -ls /user/hive/warehouseFound 11 itemsdrwxrwxrwt   - root     supergroup          0 2014-12-02 14:42 /user/hive/warehouse/h_employeedrwxrwxrwt   - root     supergroup          0 2014-12-02 14:42 /user/hive/warehouse/h_employee2drwxrwxrwt   - wlsuser  supergroup          0 2014-12-04 17:21 /user/hive/warehouse/h_employee_exportdrwxrwxrwt   - root     supergroup          0 2014-08-18 09:20 /user/hive/warehouse/h_http_access_logsdrwxrwxrwt   - root     supergroup          0 2014-06-30 10:15 /user/hive/warehouse/hbase_apache_access_logdrwxrwxrwt   - username supergroup          0 2014-06-27 17:48 /user/hive/warehouse/hbase_table_1drwxrwxrwt   - username supergroup          0 2014-06-30 09:21 /user/hive/warehouse/hbase_table_2drwxrwxrwt   - username supergroup          0 2014-06-30 09:43 /user/hive/warehouse/hive_apache_accesslogdrwxrwxrwt   - root     supergroup          0 2014-12-02 15:12 /user/hive/warehouse/hive_employee

一個檔案夾對應一個metastore表Hive 各種類型表使用內部表

CREATE TABLE workers( id INT, name STRING)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';

通過這樣的語句就建立了一個內部表叫 workers，並且分隔字元是逗號， \054 是ASCII 碼
我們可以通過 show tables; 來看看有多少表，其實hive的很多語句是模仿mysql的，當你們不知道語句的時候，把mysql的語句拿來基本可以用。除了limit比較怪，這個後面會說

hive> show tables;OKh_employeeh_employee2h_employee_exporth_http_access_logshive_employeeworkersTime taken: 0.371 seconds, Fetched: 6 row(s)

建立完後，我們試著插入幾條資料。這邊要告訴大家Hive不支援單句插入的語句，必須批量，所以不要指望能用insert into workers values (1,'jack') 這樣的語句插入資料。hive支援的插入資料的方式有兩種：

從檔案讀取資料
從別的表讀出資料插入(insert from select)

這裡我採用從檔案讀資料進來。先建立一個叫 worker.csv的檔案

$ cat workers.csv1,jack2,terry3,michael

用LOAD DATA 匯入到Hive的表中

hive> LOAD DATA LOCAL INPATH '/home/alex/workers.csv' INTO TABLE workers;Copying data from file:/home/alex/workers.csvCopying file: file:/home/alex/workers.csvLoading data to table default.workersTable default.workers stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 25, raw_data_size: 0]OKTime taken: 0.655 seconds

注意

不要少了那個 LOCAL ， LOAD DATA LOCAL INPATH 跟 LOAD DATA INPATH 的區別是一個是從你本地磁碟上找源檔案，一個是從hdfs上找檔案
如果加上OVERWRITE可以再匯入之前先清空表，比如 LOAD DATA LOCAL INPATH '/home/alex/workers.csv' OVERWRITE INTO TABLE workers;

查詢一下資料

hive> select * from workers;OK1jack2terry3michaelTime taken: 0.177 seconds, Fetched: 3 row(s)

我們去看下匯入後在hive內部表是怎麼存的

# hdfs dfs -ls /user/hive/warehouse/workers/Found 1 items-rwxrwxrwt   2 root supergroup         25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv

原來就是原封不動的把檔案拷貝進去啊！就是這麼土！我們可以實驗再放一個檔案 workers2.txt （我故意把副檔名換一個，其實hive是不看副檔名的）

# cat workers2.txt 4,peter5,kate6,ted

匯入

hive> LOAD DATA LOCAL INPATH '/home/alex/workers2.txt' INTO TABLE workers;Copying data from file:/home/alex/workers2.txtCopying file: file:/home/alex/workers2.txtLoading data to table default.workersTable default.workers stats: [num_partitions: 0, num_files: 2, num_rows: 0, total_size: 46, raw_data_size: 0]OKTime taken: 0.79 seconds

去看下檔案的儲存結構

# hdfs dfs -ls /user/hive/warehouse/workers/Found 2 items-rwxrwxrwt   2 root supergroup         25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv-rwxrwxrwt   2 root supergroup         21 2014-12-08 15:29 /user/hive/warehouse/workers/workers2.txt

多出來一個workers2.txt再用sql查詢下

hive> select * from workers;OK1jack2terry3michael4peter5kate6tedTime taken: 0.144 seconds, Fetched: 6 row(s)

分區表分區表是用來加速查詢的，比如你的資料非常多，但是你的應用情境是基於這些資料做日報表，那你就可以根據日進行分區，當你要做2014-05-05的報表的時候只需要載入2014-05-05這一天的資料就行了。我們來建立一個分區表來看下

create table partition_employee(id int, name string) partitioned by(daytime string) row format delimited fields TERMINATED BY '\054';

可以看到分區的屬性，並不是任何一個列我們先建立2個測試資料檔案，分別對應兩天的資料

# cat 2014-05-0522,kitty33,lily# cat 2014-05-0614,sami45,micky

匯入到分區表裡面

hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-05' INTO TABLE partition_employee partition(daytime='2014-05-05');Copying data from file:/home/alex/2014-05-05Copying file: file:/home/alex/2014-05-05Loading data to table default.partition_employee partition (daytime=2014-05-05)Partition default.partition_employee{daytime=2014-05-05} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]Table default.partition_employee stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]OKTime taken: 1.154 secondshive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-06' INTO TABLE partition_employee partition(daytime='2014-05-06');Copying data from file:/home/alex/2014-05-06Copying file: file:/home/alex/2014-05-06Loading data to table default.partition_employee partition (daytime=2014-05-06)Partition default.partition_employee{daytime=2014-05-06} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]Table default.partition_employee stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 42, raw_data_size: 0]OKTime taken: 0.763 seconds

匯入的時候通過 partition 來指定分區。
查詢的時候通過指定分區來查詢

hive> select * from partition_employee where daytime='2014-05-05';OK22kitty2014-05-0533lily2014-05-05Time taken: 0.173 seconds, Fetched: 2 row(s)

我的查詢語句並沒有什麼特別的文法，hive 會自動判斷你的where語句中是否包含分區的欄位。而且可以使用大於小於等運算子

hive> select * from partition_employee where daytime>='2014-05-05';OK22kitty2014-05-0533lily2014-05-0514sami2014-05-0645mick'2014-05-06Time taken: 0.273 seconds, Fetched: 4 row(s)

我們去看看儲存的結構

# hdfs dfs -ls /user/hive/warehouse/partition_employeeFound 2 itemsdrwxrwxrwt   - root supergroup          0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-05drwxrwxrwt   - root supergroup          0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-06

我們試試二維的分區表

create table p_student(id int, name string) partitioned by(daytime string,country string) row format delimited fields TERMINATED BY '\054';

查入一些資料

# cat 2014-09-09-CN 1,tammy2,eric# cat 2014-09-10-CN 3,paul4,jolly# cat 2014-09-10-EN 44,ivan66,billy

匯入hive

hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-09-CN' INTO TABLE p_student partition(daytime='2014-09-09',country='CN');Copying data from file:/home/alex/2014-09-09-CNCopying file: file:/home/alex/2014-09-09-CNLoading data to table default.p_student partition (daytime=2014-09-09, country=CN)Partition default.p_student{daytime=2014-09-09, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]Table default.p_student stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]OKTime taken: 0.736 secondshive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-CN' INTO TABLE p_student partition(daytime='2014-09-10',country='CN');Copying data from file:/home/alex/2014-09-10-CNCopying file: file:/home/alex/2014-09-10-CNLoading data to table default.p_student partition (daytime=2014-09-10, country=CN)Partition default.p_student{daytime=2014-09-10, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]Table default.p_student stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 38, raw_data_size: 0]OKTime taken: 0.691 secondshive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-EN' INTO TABLE p_student partition(daytime='2014-09-10',country='EN');Copying data from file:/home/alex/2014-09-10-ENCopying file: file:/home/alex/2014-09-10-ENLoading data to table default.p_student partition (daytime=2014-09-10, country=EN)Partition default.p_student{daytime=2014-09-10, country=EN} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]Table default.p_student stats: [num_partitions: 3, num_files: 3, num_rows: 0, total_size: 59, raw_data_size: 0]OKTime taken: 0.622 seconds

看看儲存結構

# hdfs dfs -ls /user/hive/warehouse/p_studentFound 2 itemsdrwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09drwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-10# hdfs dfs -ls /user/hive/warehouse/p_student/daytime=2014-09-09Found 1 itemsdrwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09/country=CN

查詢一下資料

hive> select * from p_student;OK1tammy2014-09-09CN2eric2014-09-09CN3paul2014-09-10CN4jolly2014-09-10CN44ivan2014-09-10EN66billy2014-09-10ENTime taken: 0.228 seconds, Fetched: 6 row(s)

hive> select * from p_student where daytime='2014-09-10' and country='EN';OK44ivan2014-09-10EN66billy2014-09-10ENTime taken: 0.224 seconds, Fetched: 2 row(s)

桶表桶表是根據某個欄位的hash值，來將資料扔到不同的“桶”裡面。外國人有個習慣，就是分類東西的時候擺幾個桶，上面貼不同的標籤，所以他們取名的時候把這種表形象的取名為桶表。桶表表專門用於採樣分析
下面這個例子是官網教程直接拷貝下來的，因為分區表跟桶表是可以同時使用的，所以這個例子中同時使用了分區跟桶兩種特性

CREATE TABLE b_student(id INT, name STRING)PARTITIONED BY(dt STRING, country STRING)CLUSTERED BY(id) SORTED BY(name) INTO 4 BUCKETSrow format delimited     fields TERMINATED BY '\054';

意思是根據userid來進行計算hash值，用viewTIme來排序儲存做資料跟匯入的過程我就不在贅述了，這是匯入後的資料

hive> select * from b_student;OK1tammy2014-09-09CN2eric2014-09-09CN3paul2014-09-10CN4jolly2014-09-10CN34allen2014-09-11ENTime taken: 0.727 seconds, Fetched: 5 row(s)

從4個桶中採樣抽取一個桶的資料

hive> select * from b_student tablesample(bucket 1 out of 4 on id);Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1406097234796_0041, Tracking URL = http://hadoop01:8088/proxy/application_1406097234796_0041/Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1406097234796_0041Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02014-12-08 17:35:56,995 Stage-1 map = 0%,  reduce = 0%2014-12-08 17:36:06,783 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.9 sec2014-12-08 17:36:07,845 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.9 secMapReduce Total cumulative CPU time: 2 seconds 900 msecEnded Job = job_1406097234796_0041MapReduce Jobs Launched: Job 0: Map: 1   Cumulative CPU: 2.9 sec   HDFS Read: 482 HDFS Write: 22 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 900 msecOK4jolly2014-09-10CN

外部表格外部表格就是儲存不是由hive來儲存的，比如可以依賴Hbase來儲存，hive只是做一個映射而已。我用Hbase來舉例
先建立一張Hbase表叫 employee

hbase(main):005:0> create 'employee','info'  0 row(s) in 0.4740 seconds    => Hbase::Table - employee  hbase(main):006:0> put 'employee',1,'info:id',1  0 row(s) in 0.2080 seconds    hbase(main):008:0> scan 'employee'  ROW                                      COLUMN+CELL                                                                                                              1                                       column=info:id, timestamp=1417591291730, value=1                                                                        1 row(s) in 0.0610 seconds    hbase(main):009:0> put 'employee',1,'info:name','peter'  0 row(s) in 0.0220 seconds    hbase(main):010:0> scan 'employee'  ROW                                      COLUMN+CELL                                                                                                              1                                       column=info:id, timestamp=1417591291730, value=1                                                                         1                                       column=info:name, timestamp=1417591321072, value=peter                                                                  1 row(s) in 0.0450 seconds    hbase(main):011:0> put 'employee',2,'info:id',2  0 row(s) in 0.0370 seconds    hbase(main):012:0> put 'employee',2,'info:name','paul'  0 row(s) in 0.0180 seconds    hbase(main):013:0> scan 'employee'  ROW                                      COLUMN+CELL                                                                                                              1                                       column=info:id, timestamp=1417591291730, value=1                                                                         1                                       column=info:name, timestamp=1417591321072, value=peter                                                                   2                                       column=info:id, timestamp=1417591500179, value=2                                                                         2                                       column=info:name, timestamp=1417591512075, value=paul                                                                   2 row(s) in 0.0440 seconds

建立外部表格進行映射

hive> CREATE EXTERNAL TABLE h_employee(key int, id int, name string)       > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'      > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name")      > TBLPROPERTIES ("hbase.table.name" = "employee");  OK  Time taken: 0.324 seconds  hive> select * from h_employee;  OK  1   1   peter  2   2   paul  Time taken: 1.129 seconds, Fetched: 2 row(s)

查詢文法具體文法可以參考官方手冊https://cwiki.apache.org/confluence/display/Hive/Tutorial 我只說幾個比較奇怪的點顯示條數展示x條資料，用的還是limit，比如

hive> select * from h_employee limit 1    > ;OK11peterTime taken: 0.284 seconds, Fetched: 1 row(s)

但是不支援起點，比如offset
下課！

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More