Hive data Import-data is stored in a Hadoop Distributed file system, and importing data into a hive table simply moves the data to the directory where the table is located!

Source: Internet
Author: User
Tags hadoop fs

transferred from: http://blog.csdn.net/lifuxiangcaohui/article/details/40588929Hive is based on the Hadoop distributed File system, and its data is stored in a Hadoop Distributed file system. Hive itself does not have a specific data storage format and does not index the data, only the column separators and row separators in the hive data are told when the table is created, and hive can parse the data. So importing data into a hive table simply moves the data to the directory where the table is located! several common ways to import data from hive
Here are four types of:
(1), import data from the local file system to the hive table;
(2), import data from HDFs to hive table;
(3), the corresponding data from other tables are queried and imported into the hive table;
(4), when the table is created by querying the corresponding records from other tables and inserted into the table created.

the default file storage format provided by Hive is Textfile, Sequencefile, Rcfile, and so on. users can also customize the input and output file formats by implementing interfaces.

In practical applications, textfile because of no compression, the disk and the resolution of the cost is very large, generally rarely used. Sequencefile a binary format stored as a key-value pair that supports compression for record-level and block-level. Rcfile is a combination of rows and columns (both text file and sequencefile are row tables [row table]), which guarantees that the same record is in the same HDFs block, and the block is stored in a column. In general, for OLTP, the row table advantage is greater than the list, for OLAP, the advantages of the list is greater than the row table, especially easy to think of as an aggregation operation, the list of complexity will be much smaller than the row table, although the individual rcfile column operation does not always exist, However, Rcfile's high compression rate does reduce file size, so in practical applications,Rcfile is always the choice, the philosophical data platform in the choice of file storage format also a large number of selected Rcfile solution .


I. Importing data from the local file system to the Hive table

Create a table in hive first, as follows:
    1. Hive> CREATE TABLE Wyp
    2. > (id int, name string,
    3. > Age int, tel string)
    4. > ROW FORMAT Delimited
    5. > Fields TERMINATED by ' \ t '
    6. > STORED as Textfile;
    7. Ok
    8. Time taken:2.832 seconds
Copy Code
This table is very simple, only four fields, the specific meaning I will not explain. The local file system has a/home/wyp/wyp.txt file with the following contents:
    1. [email protected] ~]$ cat Wyp.txt
    2. 1 WYP 25 13188888888888
    3. 2 Test 30 13888888888888
    4. 3 ZS 34 899314121
Copy Code
The data columns in the Wyp.txt file are separated by \ t, and the data inside the file can be imported into the WYP table using the following statement, as follows:
    1. hive> Load Data local inpath ' wyp.txt ' into table wyp;
    2. Copying data from File:/home/wyp/wyp.txt
    3. Copying File:file:/home/wyp/wyp.txt
    4. Loading Data to Table DEFAULT.WYP
    5. Table DEFAULT.WYP Stats:
    6. [num_partitions:0, Num_files:1, num_rows:0, total_size:67]
    7. Ok
    8. Time taken:5.967 seconds
Copy Code
This will wyp.txt inside the content into the WYP table, you can go to the WYP table data directory under the view, such as the following command:

    1. Hive> Dfs-ls/user/hive/warehouse/wyp;
    2. Found 1 Items
    3. -rw-r--r--3 wyp supergroup 2014-02-19 18:23/hive/warehouse/wyp/wyp.txt
Copy Code
It is important to note that:

Unlike the relational database we are familiar with, Hive does not now support the text form of a set of records directly within the INSERT statement, meaning that hive does not support insert into .... The statement in the form of values.

Ii. importing data into the Hive table on HDFs

In the process of importing data from the local file system into a hive table, the data is temporarily copied to a directory in HDFs (typically copied to the HDFs home directory where the user is being uploaded, such as/home/wyp/), and then the data is moved from that temporary directory (note that This is about moving, not copying! ) into the data directory of the corresponding hive table. In that case, Hive certainly supports moving data directly from a directory on HDFs to the data directory of the corresponding hive table, assuming the following file is/home/wyp/add.txt, as follows:
    1. [Email protected]/home/q/hadoop-2.2.0]$ Bin/hadoop fs-cat/home/wyp/add.txt
    2. 5 WYP1 23 131212121212
    3. 6 WYP2 24 134535353535
    4. 7 WYP3 25 132453535353
    5. 8 WYP4 26 154243434355
Copy Code
Above is the need to insert data content, this file is stored in the HDFs/HOME/WYP directory (and the one mentioned in the different, the file mentioned in the one is stored on the local file system), we can use the following command to import the contents of this file into the Hive table, the following:

  1. hive> load Data inpath '/home/wyp/add.txt ' into table wyp;
  2. Loading Data to Table DEFAULT.WYP
  3. Table DEFAULT.WYP Stats:
  4. [num_partitions:0, Num_files:2, num_rows:0, total_size:215]
  5. Ok
  6. Time taken:0.47 seconds
  7. Hive> select * from WYP;
  8. Ok
  9. 5 WYP1 23 131212121212
  10. 6 WYP2 24 134535353535
  11. 7 WYP3 25 132453535353
  12. 8 WYP4 26 154243434355
  13. 1 WYP 25 13188888888888
  14. 2 Test 30 13888888888888
  15. 3 ZS 34 899314121
  16. Time taken:0.096 seconds, Fetched:7 row (s)
Copy Code
From the above execution results we can see that the data is indeed imported into the WYP table! Please note that the load data inpath '/home/wyp/add.txt ' into table wyp; there is no local word, this is the difference from one.

Third, query the corresponding data from other tables and import into the hive table

Suppose that there is a test table in hive, and the table statement looks like this:

    1. Hive> CREATE TABLE Test (
    2. > ID int, name string
    3. >, tel String)
    4. > Partitioned by
    5. > (age int)
    6. > ROW FORMAT Delimited
    7. > Fields TERMINATED by ' \ t '
    8. > STORED as Textfile;
    9. Ok
    10. Time taken:0.261 seconds
Copy Code
Similar to the WYP table statement, except that the test table uses age as the partition field. For partitioning, here's an explanation:
partition: In hive, the corresponding directory under each partition of the table, the data for all partitions is stored in the corresponding directory. For example, the WYP table has a DT and city two partitions, the corresponding DT=20131218,CITY=BJ table directory is/USER/HIVE/WAREHOUSE/DT=20131218/CITY=BJ, all the data belonging to this partition is stored in this directory.

The following statement inserts the query results from the WYP table into the test table:
  1. hive> INSERT INTO table test
  2. > Partition (age= ' 25 ')
  3. > select ID, Name, tel
  4. > from WYP;
  5. #####################################################################
  6. Here's a bunch of mapreduce task information, omitted here
  7. #####################################################################
  8. Total MapReduce CPU time spent:1 seconds 310 msec
  9. Ok
  10. Time taken:19.125 seconds
  11. Hive> select * from test;
  12. Ok
  13. 5 WYP1 131212121212 25
  14. 6 WYP2 134535353535 25
  15. 7 WYP3 132453535353 25
  16. 8 WYP4 154243434355 25
  17. 1 Wyp 13188888888888 25
  18. 2 Test 13888888888888 25
  19. 3 ZS 899314121 25
  20. Time taken:0.126 seconds, Fetched:7 row (s)
Copy Code
here's how to do this:
We know that the form of our traditional block of data is insert into table values (field 1, Field 2), which is not supported by hive.

From the above output, we can see that the query from the WYP table has been successfully inserted into the test table! If the partition field does not exist in the target table (test), you can remove the partition (age= ' 25′) statement. Of course, we can also specify the partition dynamically in the SELECT statement by using the partition value:
  1. Hive> set hive.exec.dynamic.partition.mode=nonstrict;
  2. hive> INSERT INTO table test
  3. > Partition (age)
  4. > select ID, Name,
  5. > Tel, age
  6. > from WYP;
  7. #####################################################################
  8. Here's a bunch of mapreduce task information, omitted here
  9. #####################################################################
  10. Total MapReduce CPU time spent:1 seconds 510 msec
  11. Ok
  12. Time taken:17.712 seconds
  13. Hive> select * from test;
  14. Ok
  15. 5 Wyp1 131212121212 23
  16. 6 WYP2 134535353535 24
  17. 7 WYP3 132453535353 25
  18. 1 Wyp 13188888888888 25
  19. 8 WYP4 154243434355 26
  20. 2 Test 13888888888888 30
  21. 3 Zs 899314121 34
  22. Time taken:0.399 seconds, Fetched:7 row (s)
Copy Code
This method is called dynamic partition insertion, but it is closed by default in hive, so it is necessary to set Hive.exec.dynamic.partition.mode to nonstrict before use. Of course, Hive also supports the Insert overwrite method of inserting data, literally we can see that overwrite is the meaning of coverage, yes, when the execution of this statement, the data in the corresponding data directory will be overwritten! Insert into does not, however, pay attention to the difference between the two. Examples are as follows:

    1. hive> Insert Overwrite table test
    2. > PARTITION (age)
    3. > select ID, Name, tel, age
    4. > from WYP;
Copy Code
What's even more gratifying is that hive also supports multiple table insertions, what does that mean? In Hive, we can turn the INSERT statement upside down and put the from at the top, and its execution is the same as the following:
    1. Hive> Show CREATE TABLE test3;
    2. Ok
    3. CREATE TABLE Test3 (
    4. ID int,
    5. Name string)
    6. Time taken:0.277 seconds, fetched:18 row (s)
    7. Hive> from WYP
    8. > INSERT INTO Table Test
    9. > Partition (age)
    10. > select ID, Name, tel, age
    11. > INSERT INTO Table test3
    12. > select ID, Name
    13. > Where age>25;
    14. Hive> select * from Test3;
    15. Ok
    16. 8 WYP4
    17. 2 Test
    18. 3 ZS
    19. Time taken:4.308 seconds, Fetched:3 row (s)
Copy Code
You can use multiple insert clauses in the same query, and the advantage is that we only need to scan the source table to generate multiple disjoint outputs. This is cool!

Iv. when creating a table, by querying the corresponding record from another table and inserting it into the created table

In the actual situation, the output of the table may be too much, not suitable for display on the console, this time, the hive query output directly exists in a new table is very convenient, we call this case ctas (create table. As Select) is as follows:

    1. Hive> CREATE TABLE Test4
    2. > As
    3. > select ID, Name, tel
    4. > from WYP;
    5. Hive> select * from Test4;
    6. Ok
    7. 5 Wyp1 131212121212
    8. 6 WYP2 134535353535
    9. 7 Wyp3 132453535353
    10. 8 WYP4 154243434355
    11. 1 Wyp 13188888888888
    12. 2 Test 13888888888888
    13. 3 Zs 899314121
    14. Time taken:0.089 seconds, Fetched:7 row (s)
Copy Code

The data is inserted into the Test4 table, and the CTAs operation is atomic, so if the select query fails for some reason, the new table will not be created!

Hive data Import-data is stored in a Hadoop Distributed file system, and importing data into a hive table simply moves the data to the directory where the table is located!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.