Hello World on Impala

最後更新：2014-06-30 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：des style blog http java 使用

Cloudera Impala 官方教程《Impala Tutorial》，解說了Impala一些基本操作，但操作步驟前後缺少連貫性，本文節W選《Impala Tutorial》中的部分示範範例，從零開始解說了一個完整示範範例：建立表、載入資料、查詢資料。提供了一個入門級教程，通過本文的操作，向Impala說“Hello World”。

本文如果你已經具備了安裝好的Impala環境，環境搭建能夠參考： CDH5上安裝Hive,HBase,Impala,Spark等服務

建立cloudera使用者和組

Impala Tutorial中示範範例的登入username為cloudera，但Cloudera Manager 5.0.2 安裝時並沒有自己主動在主機節點（比如：h1.worker.com)上建立cloudera使用者，為了和Impala Tutorial 中示範範例一致，須要手工建立cloudera使用者和組。

以root使用者身份登入主機節點（比如：h1.worker.com)，先檢查下是否存在cloudera使用者，運行例如以下的命令：

[[email protected] home]# cat /etc/passwd | grep clouderacloudera-scm:x:496:493:Cloudera Manager:/var/run/cloudera-scm-server:/sbin/nologin

上面顯示不存在cloudera使用者。假設存在，則不須要進行以下的建立使用者步驟了。

建立cloudera使用者和組，並設定password為cloudera：

[[email protected] home]# groupadd cloudera[[email protected] home]# useradd -g cloud era cloudera[[email protected] home]# passwd clouderaChanging password for user cloudera.anNew password:BAD PASSWORD: it is based on a dictionary wordRetype new password:passwd: all authentication tokens updated successfully.

在HDFS上建立/user/cloudera目錄

我們須要在HDFS上建立/user/cloudera目錄，並將這個目錄的全部者改動為cloudera，這須要HDFS的超級使用者才有許可權執行這些操作。HDFS的超級使用者即執行name node進程的使用者。寬泛的講，假設你啟動了name node，你就是超級使用者。通過Cloudera Manager 5安裝環境的超級username為：hdfs

切換到HDFS的超級使用者，先檢查是否存在 /user/cloudera 目錄，假設不存在則建立。

[[email protected] home]# su - hdfs-bash-4.1$ hdfs dfs -ls /userFound 7 itemsdrwx------   - hdfs   supergroup          0 2014-06-26 08:44 /user/hdfsdrwxrwxrwx   - mapred hadoop              0 2014-06-20 10:10 /user/historydrwxrwxr-t   - hive   hive                0 2014-06-20 10:13 /user/hivedrwxrwxr-x   - impala impala              0 2014-06-20 10:18 /user/impaladrwxrwxr-x   - oozie  oozie               0 2014-06-20 10:15 /user/ooziedrwxr-x--x   - spark  spark               0 2014-06-20 10:08 /user/sparkdrwxrwxr-x   - sqoop2 sqoop               0 2014-06-20 10:16 /user/sqoop2

在HDFS上建立 /user/cloudera 檔案夾，設定檔案夾的全部者和組為cloudera

-bash-4.1$ hdfs dfs -mkdir -p /user/cloudera-bash-4.1$ hdfs dfs -chown cloudera:cloudera /user/cloudera-bash-4.1$ hdfs dfs -ls /userFound 8 itemsdrwxr-xr-x   - cloudera cloudera            0 2014-06-26 09:05 /user/clouderadrwx------   - hdfs     supergroup          0 2014-06-26 08:44 /user/hdfsdrwxrwxrwx   - mapred   hadoop              0 2014-06-20 10:10 /user/historydrwxrwxr-t   - hive     hive                0 2014-06-20 10:13 /user/hivedrwxrwxr-x   - impala   impala              0 2014-06-20 10:18 /user/impaladrwxrwxr-x   - oozie    oozie               0 2014-06-20 10:15 /user/ooziedrwxr-x--x   - spark    spark               0 2014-06-20 10:08 /user/sparkdrwxrwxr-x   - sqoop2   sqoop               0 2014-06-20 10:16 /user/sqoop2

經過以上的操作已經具備了執行 Impala Tutorial中示範範例的條件。

HDFS上建立裝載表資料的檔案夾本節示範怎樣建立一些很小的表，適合初次使用的使用者實驗 Impala SQL 功能。 TAB1 和 TAB2 從 HDFS 檔案裡載入資料。能夠把你想查詢的資料放入 HDFS 中。想開始這一過程，先在你的 HDFS 使用者檔案夾下建立一個或多個子檔案夾。每一個表中的資料存放在單獨的子檔案夾裡。這個範例使用 mkdir 中的 -p 選項，這樣假設不存在的父資料夾中則自己主動建立。

[[email protected] ~]# su - cloudera[[email protected] ~]$ whoamicloudera[[email protected] ~]$ hdfs dfs -ls /userFound 8 itemsdrwxr-xr-x   - cloudera cloudera            0 2014-06-26 09:05 /user/clouderadrwx------   - hdfs     supergroup          0 2014-06-26 08:44 /user/hdfsdrwxrwxrwx   - mapred   hadoop              0 2014-06-20 10:10 /user/historydrwxrwxr-t   - hive     hive                0 2014-06-20 10:13 /user/hivedrwxrwxr-x   - impala   impala              0 2014-06-20 10:18 /user/impaladrwxrwxr-x   - oozie    oozie               0 2014-06-20 10:15 /user/ooziedrwxr-x--x   - spark    spark               0 2014-06-20 10:08 /user/sparkdrwxrwxr-x   - sqoop2   sqoop               0 2014-06-20 10:16 /user/sqoop2[[email protected] ~]$ hdfs dfs -mkdir -p /user/cloudera/sample_data/tab1 /user/cloudera/sample_data/tab2[[email protected] ~]$

通過以上的操作，就建立了存放TAB1 和 TAB2表資料的檔案夾。

csv檔案存放到HDFS檔案夾拷貝例如以下的兩個.csv檔案到本地的檔案系統。

tab1.csv:

1,true,123.123,2012-10-24 08:55:00 2,false,1243.5,2012-10-25 13:40:003,false,24453.325,2008-08-22 09:33:21.1234,false,243423.325,2007-05-12 22:32:21.334545,true,243.325,1953-04-22 09:11:33

tab2.csv:

1,true,12789.1232,false,1243.53,false,24453.3254,false,2423.32545,true,243.32560,false,243565423.32570,true,243.32580,false,243423.32590,true,243.325

運行以下的命令將兩個 .csv 檔案放入單獨的 HDFS 資料夾：

[[email protected] testdata]$ pwd/home/cloudera/testdata[[email protected] testdata]$ lltotal 8-rw-rw-r--. 1 cloudera cloudera 193 Jun 27 08:33 tab1.csv-rw-rw-r--. 1 cloudera cloudera 158 Jun 27 08:34 tab2.csv[[email protected] testdata]$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1[[email protected] testdata]$ hdfs dfs -ls /user/cloudera/sample_data/tab1Found 1 items-rw-r--r--   3 cloudera cloudera        193 2014-06-27 08:35 /user/cloudera/sample_data/tab1/tab1.csv[[email protected] testdata]$ hdfs dfs -put tab2.csv /user/cloudera/sample_data/tab2[[email protected] testdata]$ hdfs dfs -ls /user/cloudera/sample_data/tab2Found 1 items-rw-r--r--   3 cloudera cloudera        158 2014-06-27 08:36 /user/cloudera/sample_data/tab2/tab2.csv[[email protected] testdata]$

每一個資料檔案的名稱不重要。其實，當 Impala 第一次檢測資料檔案夾的內容時，它覺得檔案夾下的全部檔案都是表中的資料檔案，不管檔案夾下有多少檔案，不管什麼樣的檔案名稱。
要瞭解你的 HDFS 檔案系統中什麼檔案夾可用，不同的檔案夾和檔案都有什麼許可權，運行 hdfs dfs -ls / 並沿著看到的資料夾樹狀目錄一直運行 -ls 操作。

建立表，載入資料

使用 impala-shell 命令建立表，能夠用互動式建立，也能夠用 SQL 指令碼。

以下的範例示範建立了三個表。每一個表中的列都使用了不同的資料類型，如 Boolean 或 integer。範例還包括了怎樣格式資料的命令，比如列以逗號分隔，這樣從 .csv 檔案匯入資料。我們已經有了存放在 HDFS 資料夾樹中的包括資料的 .csv 檔案，我們給表指定了包括相應 .csv 檔案的路徑位置。Impala 覺得這些檔案夾下的全部檔案中的全部資料都是表裡的資料。

table_setup.sql 檔案包括例如以下內容:

DROP TABLE IF EXISTS tab1;-- The EXTERNAL clause means the data is located outside the central location for Impala data files-- and is preserved when the associated Impala table is dropped. We expect the data to already-- exist in the directory specified by the LOCATION clause.CREATE EXTERNAL TABLE tab1(   id INT,   col_1 BOOLEAN,   col_2 DOUBLE,   col_3 TIMESTAMP)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','LOCATION '/user/cloudera/sample_data/tab1';DROP TABLE IF EXISTS tab2;-- TAB2 is an external table, similar to TAB1.CREATE EXTERNAL TABLE tab2(   id INT,   col_1 BOOLEAN,   col_2 DOUBLE)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','LOCATION '/user/cloudera/sample_data/tab2';DROP TABLE IF EXISTS tab3;-- Leaving out the EXTERNAL clause means the data will be managed-- in the central Impala data directory tree. Rather than reading-- existing data files when the table is created, we load the-- data after creating the table.CREATE TABLE tab3(   id INT,   col_1 BOOLEAN,   col_2 DOUBLE,   month INT,   day INT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

運行 table_setup.sql 指令碼，使用：
impala-shell -i 172.16.230.152 -f table_setup.sql
操作過程例如以下：

[[email protected] testdata]$ pwd/home/cloudera/testdata[[email protected] testdata]$ lltotal 12-rw-rw-r--. 1 cloudera cloudera  193 Jun 27 08:33 tab1.csv-rw-rw-r--. 1 cloudera cloudera  158 Jun 27 08:34 tab2.csv-rw-rw-r--. 1 cloudera cloudera 1106 Jun 27 08:49 table_setup.sql[[email protected] testdata]$ impala-shell -i 172.16.230.152 -f table_setup.sqlStarting Impala Shell without Kerberos authenticationConnected to 172.16.230.152:21000Server version: impalad version 1.3.1-cdh5 RELEASE (build )......Returned 0 row(s) in 0.28s[[email protected] testdata]$

查看 Impala 表結構登入impala-shell，運行以下的命令：show tables;
describe tab1;
操作過程例如以下：

[[email protected] testdata]$ impala-shell -i 172.16.230.152Starting Impala Shell without Kerberos authenticationConnected to 172.16.230.152:21000Server version: impalad version 1.3.1-cdh5 RELEASE (build )Welcome to the Impala shell. Press TAB twice to see a list of available commands.Copyright (c) 2012 Cloudera, Inc. All rights reserved.(Shell build version: Impala Shell v1.3.1-cdh5 () built on Mon Jun  9 09:30:26 PDT 2014)[172.16.230.152:21000] > show tables;Query: show tables+------+| name |+------+| tab1 || tab2 || tab3 |+------+Returned 3 row(s) in 0.01s[172.16.230.152:21000] > describe tab1;Query: describe tab1+-------+-----------+---------+| name  | type      | comment |+-------+-----------+---------+| id    | int       |         || col_1 | boolean   |         || col_2 | double    |         || col_3 | timestamp |         |+-------+-----------+---------+Returned 4 row(s) in 6.85s[172.16.230.152:21000] > quit;Goodbye[[email protected] testdata]$

查詢 Impala 表

登入impala-shell，運行例如以下的sql語句：

SELECT * FROM tab1;

SELECT * FROM tab2 LIMIT 5;

SELECT tab2.*
FROM tab2,
(SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
FROM tab2, tab1
WHERE tab1.id = tab2.id
GROUP BY col_1) subquery1
WHERE subquery1.max_col2 = tab2.col_2;

操作過程例如以下：

[[email protected] testdata]$ impala-shell -i 172.16.230.152Starting Impala Shell without Kerberos authenticationConnected to 172.16.230.152:21000Server version: impalad version 1.3.1-cdh5 RELEASE (build )Welcome to the Impala shell. Press TAB twice to see a list of available commands.Copyright (c) 2012 Cloudera, Inc. All rights reserved.(Shell build version: Impala Shell v1.3.1-cdh5 () built on Mon Jun  9 09:30:26 PDT 2014)[172.16.230.152:21000] > SELECT * FROM tab1;Query: select * FROM tab1+----+-------+------------+-------------------------------+| id | col_1 | col_2      | col_3                         |+----+-------+------------+-------------------------------+| 1  | true  | 123.123    | 2012-10-24 08:55:00           || 2  | false | 1243.5     | 2012-10-25 13:40:00           || 3  | false | 24453.325  | 2008-08-22 09:33:21.123000000 || 4  | false | 243423.325 | 2007-05-12 22:32:21.334540000 || 5  | true  | 243.325    | 1953-04-22 09:11:33           |+----+-------+------------+-------------------------------+Returned 5 row(s) in 2.39s[172.16.230.152:21000] > SELECT * FROM tab2 LIMIT 5;Query: select * FROM tab2 LIMIT 5+----+-------+-----------+| id | col_1 | col_2     |+----+-------+-----------+| 1  | true  | 12789.123 || 2  | false | 1243.5    || 3  | false | 24453.325 || 4  | false | 2423.3254 || 5  | true  | 243.325   |+----+-------+-----------+Returned 5 row(s) in 1.30s[172.16.230.152:21000] > SELECT tab2.*                       > FROM tab2,                       > (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2                       >  FROM tab2, tab1                       >  WHERE tab1.id = tab2.id                       >  GROUP BY col_1) subquery1                       > WHERE subquery1.max_col2 = tab2.col_2;Query: select tab2.* FROM tab2, (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 FROM tab2, tab1 WHERE tab1.id = tab2.id GROUP BY col_1) subquery1 WHERE subquery1.max_col2 = tab2.col_2+----+-------+-----------+| id | col_1 | col_2     |+----+-------+-----------+| 1  | true  | 12789.123 || 3  | false | 24453.325 |+----+-------+-----------+Returned 2 row(s) in 1.02s[172.16.230.152:21000] > quit;Goodbye[[email protected] testdata]$

結束語：

本文解說了一個Impala使用的基本示範範例，提供了一個入門指導，很多其它的示範範例參見： Impala Tutorial

本文使用了很多 impala-shell 命令的方法，詳細參見 Using the Impala Shell (impala-shell Command)

原創作品，轉載請註明出處 http://blog.csdn.net/yangzhaohui168/article/details/35340387

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More