Several data import methods of hive

Last Update:2015-11-02 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Several data import methods of hive

Today's topic is a summary of several common data import methods for hive, which I summarize in four ways:
(1), import data from the local file system to the hive table;
(2), import data from HDFs to hive table;
(3), the corresponding data from other tables are queried and imported into the hive table;
(4), when the table is created by querying the corresponding records from other tables and inserted into the table created.
I will actually do the import of every kind of data, because the sheer text makes people look dull, and it's very abstract to learn. All right, get to work!

I. Importing data from the local file system to the Hive table
Create a table in hive first, as follows:

Hive> CREATE TABLE Wyp
> (id int, name string,
> Age int, tel string)
> ROW FORMAT Delimited
> Fields TERMINATED by ' \ t '
> STORED as Textfile;
Ok
Time taken:2.832 seconds

This table is very simple, only four fields, the specific meaning I will not explain. The local file system has a/home/wyp/wyp.txt file with the following contents:

[email protected] ~]$ cat Wyp.txt
1 WYP 25 13188888888888
2 Test 30 13888888888888
3 ZS 34 899314121

The data columns in the Wyp.txt file are separated by \ t, and the data inside the file can be imported into the WYP table using the following statement, as follows:

hive> Load Data local inpath ' wyp.txt ' into table wyp;
Copying data from File:/home/wyp/wyp.txt
Copying File:file:/home/wyp/wyp.txt
Loading Data to Table DEFAULT.WYP
Table DEFAULT.WYP Stats:
[num_partitions:0, Num_files:1, num_rows:0, total_size:67]
Ok
Time taken:5.967 seconds

This will wyp.txt inside the content into the WYP table, you can go to the WYP table data directory under the view, such as the following command:

Hive> Dfs-ls/user/hive/warehouse/wyp;
Found 1 Items
-rw-r--r--3 wyp supergroup 2014-02-19 18:23/hive/warehouse/wyp/wyp.txt

The data was actually imported into the WYP table.

Unlike the relational database we are familiar with, Hive does not now support the text form of a set of records directly within the INSERT statement, meaning that hive does not support insert into .... The statement in the form of values.

Ii. importing data into the Hive table on HDFs
In the process of importing data from the local file system into a hive table, the data is temporarily copied to a directory in HDFs (typically copied to the HDFs home directory where the user is being uploaded, such as/home/wyp/), and then the data is moved from that temporary directory (note that This is about moving, not copying! ) into the data directory of the corresponding hive table. In that case, Hive certainly supports moving data directly from a directory on HDFs to the data directory of the corresponding hive table, assuming the following file is/home/wyp/add.txt, as follows:

[Email protected]/home/q/hadoop-2.2.0]$ Bin/hadoop fs-cat/home/wyp/add.txt
5 WYP1 23 131212121212
6 WYP2 24 134535353535
7 WYP3 25 132453535353
8 WYP4 26 154243434355

Above is the need to insert data content, this file is stored in the HDFs/HOME/WYP directory (and the one mentioned in the different, the file mentioned in the one is stored on the local file system), we can use the following command to import the contents of this file into the Hive table, the following:

hive> load Data inpath '/home/wyp/add.txt ' into table wyp;
Loading Data to Table DEFAULT.WYP
Table DEFAULT.WYP Stats:
[num_partitions:0, Num_files:2, num_rows:0, total_size:215]
Ok
Time taken:0.47 seconds

Hive> select * from WYP;
Ok
5 WYP1 23 131212121212
6 WYP2 24 134535353535
7 WYP3 25 132453535353
8 WYP4 26 154243434355
1 WYP 25 13188888888888
2 Test 30 13888888888888
3 ZS 34 899314121
Time taken:0.096 seconds, Fetched:7 row (s)

From the above execution results we can see that the data is indeed imported into the WYP table! Please note that the load data inpath '/home/wyp/add.txt ' into table wyp; there is no local word, this is the difference from one.

Third, query the corresponding data from other tables and import into the hive table
Suppose that there is a test table in hive, and the table statement looks like this:

Hive> CREATE TABLE Test (
> ID int, name string
>, tel String)
> Partitioned by
> (age int)
> ROW FORMAT Delimited
> Fields TERMINATED by ' \ t '
> STORED as Textfile;
Ok
Time taken:0.261 seconds

Similar to the WYP table statement, but the test table with age as the partition field (about what is the partition field, see the blog of the "Hive data storage Mode" in the introduction, the detailed introduction of this blog will be introduced in the next time, please follow this blog!) ）。 The following statement inserts the query results from the WYP table into the test table:

hive> INSERT INTO table test
> Partition (age= ' 25 ')
> select ID, Name, tel
> from WYP;
#####################################################################
Here's a bunch of mapreduce task information, omitted here
#####################################################################
Total MapReduce CPU time spent:1 seconds 310 msec
Ok
Time taken:19.125 seconds

Hive> select * from test;
Ok
5 WYP1 131212121212 25
6 WYP2 134535353535 25
7 WYP3 132453535353 25
8 WYP4 154243434355 25
1 Wyp 13188888888888 25
2 Test 13888888888888 25
3 ZS 899314121 25
Time taken:0.126 seconds, Fetched:7 row (s)

From the above output, we can see that the query from the WYP table has been successfully inserted into the test table! If the partition field does not exist in the target table (test), you can remove the partition (age= ' 25 ') statement. Of course, we can also specify the partition dynamically in the SELECT statement by using the partition value:

Hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> INSERT INTO table test
> Partition (age)
> select ID, Name,
> Tel, age
> from WYP;
#####################################################################
Here's a bunch of mapreduce task information, omitted here
#####################################################################
Total MapReduce CPU time spent:1 seconds 510 msec
Ok
Time taken:17.712 seconds

Hive> select * from test;
Ok
5 Wyp1 131212121212 23
6 WYP2 134535353535 24
7 WYP3 132453535353 25
1 Wyp 13188888888888 25
8 WYP4 154243434355 26
2 Test 13888888888888 30
3 Zs 899314121 34
Time taken:0.399 seconds, Fetched:7 row (s)

This method is called dynamic partition insertion, but it is closed by default in hive, so it is necessary to set Hive.exec.dynamic.partition.mode to nonstrict before use. Of course, Hive also supports the Insert overwrite method of inserting data, literally we can see that overwrite is the meaning of coverage, yes, when the execution of this statement, the data in the corresponding data directory will be overwritten! Insert into does not, however, pay attention to the difference between the two. Examples are as follows:

hive> Insert Overwrite table test
> PARTITION (age)
> select ID, Name, tel, age
> from WYP;

What's even more gratifying is that hive also supports multiple table insertions, what does that mean? In Hive, we can turn the INSERT statement upside down and put the from at the top, and its execution is the same as the following:

Hive> Show CREATE TABLE test3;
Ok
CREATE TABLE Test3 (
ID int,
Name string)
Time taken:0.277 seconds, fetched:18 row (s)

Hive> from WYP
> INSERT INTO Table Test
> Partition (age)
> select ID, Name, tel, age
> INSERT INTO Table test3
> select ID, Name
> Where age>25;

Hive> select * from Test3;
Ok
8 WYP4
2 Test
3 ZS
Time taken:4.308 seconds, Fetched:3 row (s)

You can use multiple insert clauses in the same query, and the advantage is that we only need to scan the source table to generate multiple disjoint outputs. This is cool!

Iv. when creating a table, by querying the corresponding record from another table and inserting it into the created table
In the actual situation, the output of the table may be too much, not suitable for display on the console, this time, the hive query output directly exists in a new table is very convenient, we call this case ctas (create table. As Select) is as follows:

Hive> CREATE TABLE Test4
> As
> select ID, Name, tel
> from WYP;

Hive> select * from Test4;
Ok
5 Wyp1 131212121212
6 WYP2 134535353535
7 Wyp3 132453535353
8 WYP4 154243434355
1 Wyp 13188888888888
2 Test 13888888888888
3 Zs 899314121
Time taken:0.089 seconds, Fetched:7 row (s)

The data is inserted into the Test4 table, and the CTAs operation is atomic, so if the select query fails for some reason, the new table will not be created!
　

Reprinted from Past memory (http://www.iteblog.com/)

Several data import methods of hive

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More