Hello world on Impala

Last Update:2014-06-29 Source: Internet

Author: User

Tags hdfs dfs sqoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The official cloudera Impala tutorial explains some basic Impala operations, but there is a lack of coherence before and after the operation steps. In this section, W selects some examples in impala tutorial, A complete example is provided from scratch: creating tables, loading data, and querying data. An entry-level tutorial is provided to explain "Hello World" to Impala through the operations in this article ".

This article assumes that you have already installed the impala environment. For environment setup, see install hive, hbase, impala, spark and other services on cdh5.

Create cloudera users and groups

The logon username in impala tutorial is cloudera, but cloudera manager 5.0.2 does not automatically create a cloudera user on the host node (for example, h1.worker.com) to be consistent with the example in impala tutorial, you need to manually create cloudera users and groups.

Log on to the host node as the root user (for example, h1.worker.com), first check whether there is a cloudera user, and execute the following command:

[[email protected] home]# cat /etc/passwd | grep clouderacloudera-scm:x:496:493:Cloudera Manager:/var/run/cloudera-scm-server:/sbin/nologin

Cloudera user does not exist. If yes, you do not need to follow the steps below to create a user.

Create a cloudera user and group, and set the password to cloudera:

[[email protected] home]# groupadd cloudera[[email protected] home]# useradd -g cloud era cloudera[[email protected] home]# passwd clouderaChanging password for user cloudera.anNew password:BAD PASSWORD: it is based on a dictionary wordRetype new password:passwd: all authentication tokens updated successfully.

Create a/user/cloudera folder on HDFS

We need to create a new/user/cloudera folder on HDFS and change the owner of this folder to cloudera. This requires HDFS superusers to have the permission to perform these operations. HDFS superuser is the user who runs the Name node process. Broadly speaking, if you start Name node, you are a Super User. The Super User name used to install the environment through cloudera manager 5 is:HDFS

Switch to the HDFS superuser and check whether the/user/cloudera folder exists. If not, create the folder.

[[email protected] home]# su - hdfs-bash-4.1$ hdfs dfs -ls /userFound 7 itemsdrwx------   - hdfs   supergroup          0 2014-06-26 08:44 /user/hdfsdrwxrwxrwx   - mapred hadoop              0 2014-06-20 10:10 /user/historydrwxrwxr-t   - hive   hive                0 2014-06-20 10:13 /user/hivedrwxrwxr-x   - impala impala              0 2014-06-20 10:18 /user/impaladrwxrwxr-x   - oozie  oozie               0 2014-06-20 10:15 /user/ooziedrwxr-x--x   - spark  spark               0 2014-06-20 10:08 /user/sparkdrwxrwxr-x   - sqoop2 sqoop               0 2014-06-20 10:16 /user/sqoop2

Create the/user/cloudera directory on HDFS and set the directory owner and group to cloudera.

-bash-4.1$ hdfs dfs -mkdir -p /user/cloudera-bash-4.1$ hdfs dfs -chown cloudera:cloudera /user/cloudera-bash-4.1$ hdfs dfs -ls /userFound 8 itemsdrwxr-xr-x   - cloudera cloudera            0 2014-06-26 09:05 /user/clouderadrwx------   - hdfs     supergroup          0 2014-06-26 08:44 /user/hdfsdrwxrwxrwx   - mapred   hadoop              0 2014-06-20 10:10 /user/historydrwxrwxr-t   - hive     hive                0 2014-06-20 10:13 /user/hivedrwxrwxr-x   - impala   impala              0 2014-06-20 10:18 /user/impaladrwxrwxr-x   - oozie    oozie               0 2014-06-20 10:15 /user/ooziedrwxr-x--x   - spark    spark               0 2014-06-20 10:08 /user/sparkdrwxrwxr-x   - sqoop2   sqoop               0 2014-06-20 10:16 /user/sqoop2

The above operations have met the conditions for running the impala tutorial sample.

This section describes how to create a directory for loading table data on HDFS. It is suitable for first-time users to experiment with Impala SQL. Tab1 and tab2 load data from HDFS files. You can put the data you want to query into HDFS. To start this process, create one or more subdirectories under your HDFS user directory. Data in each table is stored in a separate subdirectory. This example uses the-P option in mkdir, which is automatically created if the parent directory does not exist.

[[email protected] ~]# su - cloudera[[email protected] ~]$ whoamicloudera[[email protected] ~]$ hdfs dfs -ls /userFound 8 itemsdrwxr-xr-x   - cloudera cloudera            0 2014-06-26 09:05 /user/clouderadrwx------   - hdfs     supergroup          0 2014-06-26 08:44 /user/hdfsdrwxrwxrwx   - mapred   hadoop              0 2014-06-20 10:10 /user/historydrwxrwxr-t   - hive     hive                0 2014-06-20 10:13 /user/hivedrwxrwxr-x   - impala   impala              0 2014-06-20 10:18 /user/impaladrwxrwxr-x   - oozie    oozie               0 2014-06-20 10:15 /user/ooziedrwxr-x--x   - spark    spark               0 2014-06-20 10:08 /user/sparkdrwxrwxr-x   - sqoop2   sqoop               0 2014-06-20 10:16 /user/sqoop2[[email protected] ~]$ hdfs dfs -mkdir -p /user/cloudera/sample_data/tab1 /user/cloudera/sample_data/tab2[[email protected] ~]$

The preceding operations create a directory for storing tab1 and tab2 table data.

The csvfile is saved to the local file system.

Tab1.csv:

1,true,123.123,2012-10-24 08:55:00 2,false,1243.5,2012-10-25 13:40:003,false,24453.325,2008-08-22 09:33:21.1234,false,243423.325,2007-05-12 22:32:21.334545,true,243.325,1953-04-22 09:11:33

Tab2.csv:

1,true,12789.1232,false,1243.53,false,24453.3254,false,2423.32545,true,243.32560,false,243565423.32570,true,243.32580,false,243423.32590,true,243.325

Run the following command to put two. CSV files into a separate HDFS directory:

[[email protected] testdata]$ pwd/home/cloudera/testdata[[email protected] testdata]$ lltotal 8-rw-rw-r--. 1 cloudera cloudera 193 Jun 27 08:33 tab1.csv-rw-rw-r--. 1 cloudera cloudera 158 Jun 27 08:34 tab2.csv[[email protected] testdata]$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1[[email protected] testdata]$ hdfs dfs -ls /user/cloudera/sample_data/tab1Found 1 items-rw-r--r--   3 cloudera cloudera        193 2014-06-27 08:35 /user/cloudera/sample_data/tab1/tab1.csv[[email protected] testdata]$ hdfs dfs -put tab2.csv /user/cloudera/sample_data/tab2[[email protected] testdata]$ hdfs dfs -ls /user/cloudera/sample_data/tab2Found 1 items-rw-r--r--   3 cloudera cloudera        158 2014-06-27 08:36 /user/cloudera/sample_data/tab2/tab2.csv[[email protected] testdata]$

The name of each data file is not important. In fact, when Impala detects the content of the data directory for the first time, it considers all the files in the directory as data files in the table, regardless of the number of files in the directory, regardless of the file name.
You need to know what directories are available in your HDFS file system and what permissions are available for different directories and files. Execute hdfs dfs-ls/and keep running-ls operations along the directory tree you see.

Create a table and load data

You can use the impala-shell command to create a table, interactive creation, or SQL script.

The following example shows how to create three tables. Columns in each table use different data types, such as Boolean OR integer. The example also contains commands for how to format data. For example, columns are separated by commas to import data from the. CSV file. We have a. CSV file that contains data stored in the HDFS directory tree. We have specified the path location for the table that contains the corresponding. CSV file. Impala considers all the data in all the files in these directories as table data.

The table_setup. SQL file contains the following content:

DROP TABLE IF EXISTS tab1;-- The EXTERNAL clause means the data is located outside the central location for Impala data files-- and is preserved when the associated Impala table is dropped. We expect the data to already-- exist in the directory specified by the LOCATION clause.CREATE EXTERNAL TABLE tab1(   id INT,   col_1 BOOLEAN,   col_2 DOUBLE,   col_3 TIMESTAMP)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','LOCATION '/user/cloudera/sample_data/tab1';DROP TABLE IF EXISTS tab2;-- TAB2 is an external table, similar to TAB1.CREATE EXTERNAL TABLE tab2(   id INT,   col_1 BOOLEAN,   col_2 DOUBLE)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','LOCATION '/user/cloudera/sample_data/tab2';DROP TABLE IF EXISTS tab3;-- Leaving out the EXTERNAL clause means the data will be managed-- in the central Impala data directory tree. Rather than reading-- existing data files when the table is created, we load the-- data after creating the table.CREATE TABLE tab3(   id INT,   col_1 BOOLEAN,   col_2 DOUBLE,   month INT,   day INT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Run the table_setup. SQL script and use:
Impala-shell-I 172.16.230.152-F table_setup. SQL
The procedure is as follows:

[[email protected] testdata]$ pwd/home/cloudera/testdata[[email protected] testdata]$ lltotal 12-rw-rw-r--. 1 cloudera cloudera  193 Jun 27 08:33 tab1.csv-rw-rw-r--. 1 cloudera cloudera  158 Jun 27 08:34 tab2.csv-rw-rw-r--. 1 cloudera cloudera 1106 Jun 27 08:49 table_setup.sql[[email protected] testdata]$ impala-shell -i 172.16.230.152 -f table_setup.sqlStarting Impala Shell without Kerberos authenticationConnected to 172.16.230.152:21000Server version: impalad version 1.3.1-cdh5 RELEASE (build )......Returned 0 row(s) in 0.28s[[email protected] testdata]$

View the impala table structure and log on to Impala-shell. Run the following command: show tables;
Describe tab1;
The procedure is as follows:

[[email protected] testdata]$ impala-shell -i 172.16.230.152Starting Impala Shell without Kerberos authenticationConnected to 172.16.230.152:21000Server version: impalad version 1.3.1-cdh5 RELEASE (build )Welcome to the Impala shell. Press TAB twice to see a list of available commands.Copyright (c) 2012 Cloudera, Inc. All rights reserved.(Shell build version: Impala Shell v1.3.1-cdh5 () built on Mon Jun  9 09:30:26 PDT 2014)[172.16.230.152:21000] > show tables;Query: show tables+------+| name |+------+| tab1 || tab2 || tab3 |+------+Returned 3 row(s) in 0.01s[172.16.230.152:21000] > describe tab1;Query: describe tab1+-------+-----------+---------+| name  | type      | comment |+-------+-----------+---------+| id    | int       |         || col_1 | boolean   |         || col_2 | double    |         || col_3 | timestamp |         |+-------+-----------+---------+Returned 4 row(s) in 6.85s[172.16.230.152:21000] > quit;Goodbye[[email protected] testdata]$

Query Impala tables

Log on to Impala-shell and execute the following SQL statement:

Select * From tab1;

Select * From tab2 limit 5;

Select tab2 .*
From tab2,
(Select tab1.col _ 1, max (tab2.col _ 2) As max_col2
From tab2, tab1
Where tab1.id = tab2.id
Group by col_1) subquery1
Where subquery1.max _ col2 = tab2.col _ 2;

The procedure is as follows:

[[email protected] testdata]$ impala-shell -i 172.16.230.152Starting Impala Shell without Kerberos authenticationConnected to 172.16.230.152:21000Server version: impalad version 1.3.1-cdh5 RELEASE (build )Welcome to the Impala shell. Press TAB twice to see a list of available commands.Copyright (c) 2012 Cloudera, Inc. All rights reserved.(Shell build version: Impala Shell v1.3.1-cdh5 () built on Mon Jun  9 09:30:26 PDT 2014)[172.16.230.152:21000] > SELECT * FROM tab1;Query: select * FROM tab1+----+-------+------------+-------------------------------+| id | col_1 | col_2      | col_3                         |+----+-------+------------+-------------------------------+| 1  | true  | 123.123    | 2012-10-24 08:55:00           || 2  | false | 1243.5     | 2012-10-25 13:40:00           || 3  | false | 24453.325  | 2008-08-22 09:33:21.123000000 || 4  | false | 243423.325 | 2007-05-12 22:32:21.334540000 || 5  | true  | 243.325    | 1953-04-22 09:11:33           |+----+-------+------------+-------------------------------+Returned 5 row(s) in 2.39s[172.16.230.152:21000] > SELECT * FROM tab2 LIMIT 5;Query: select * FROM tab2 LIMIT 5+----+-------+-----------+| id | col_1 | col_2     |+----+-------+-----------+| 1  | true  | 12789.123 || 2  | false | 1243.5    || 3  | false | 24453.325 || 4  | false | 2423.3254 || 5  | true  | 243.325   |+----+-------+-----------+Returned 5 row(s) in 1.30s[172.16.230.152:21000] > SELECT tab2.*                       > FROM tab2,                       > (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2                       >  FROM tab2, tab1                       >  WHERE tab1.id = tab2.id                       >  GROUP BY col_1) subquery1                       > WHERE subquery1.max_col2 = tab2.col_2;Query: select tab2.* FROM tab2, (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 FROM tab2, tab1 WHERE tab1.id = tab2.id GROUP BY col_1) subquery1 WHERE subquery1.max_col2 = tab2.col_2+----+-------+-----------+| id | col_1 | col_2     |+----+-------+-----------+| 1  | true  | 12789.123 || 3  | false | 24453.325 |+----+-------+-----------+Returned 2 row(s) in 1.02s[172.16.230.152:21000] > quit;Goodbye[[email protected] testdata]$

Conclusion:

This article describes a basic example of Impala and provides an entry guide. For more examples, see Impala tutorial.

This article uses many Impala-shell commands. For more information, see using the impala shell (Impala-shell command)

Original works, reprinted please indicate the source http://blog.csdn.net/yangzhaohui168/article/details/35340387

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More