Hive partition table creation, classification

Source: Internet
Author: User
Tags hadoop fs

First, partition table creation and description

Partition must be created when the table is defined

A, single partition build Table statement: CREATE TABLE day_table (id int, content String) partitioned by (DT string), single partition table, partitioned by day, Id,content,dt three columns exist in the table structure.

Differentiate with DT for folders

b, two-partition statement: CREATE TABLE day_hour_table (id int, content String) partitioned by (DT string, hour string), double-partitioned table, partitioned by day and hour, in table structure New addition of DT and hour two columns.

Start with the DT folder, then the hour sub-folder to differentiate

Add partition Table Syntax (table created, add partition on this basis): ALTER table table_name Add
Partition_spec [location ' Location1 ']
Partition_spec [location ' Location2 '] ...

ALTER TABLE day_table ADD
PARTITION (dt= ' 2008-08-08 ', hour= ' 08 ')
Location '/path/pv1.txt '

Delete partition syntax : ALTER TABLE table_name DROP
Partition_spec, Partition_spec,...

The user can delete the partition with ALTER TABLE drop PARTITION. The metadata and data for the partition will be deleted together. Example: ALTER TABLE day_hour_table DROP PARTITION (dt= ' 2008-08-08 ', hour= ' 09 ');

data load into partition table syntax : Load data [LOCAL] inpath ' filepath ' [OVERWRITE] into table tablename [PARTITION (Partcol1=val1, part Col2=val2 ...)]

Example: LOAD DATA inpath '/user/pv.txt ' into TABLE day_hour_table PARTITION (dt= ' 2008-08-08 ', hour= ' 08 '); Load data local inpath '/user/hua/* ' into table Day_hour partition (dt= ' 2010-07-07 '); No transformations are made to the data when the data is loaded into the table. The load operation simply copies the data to the location corresponding to the Hive table. Automatically create a directory under the table when data is loaded
Partition-based query statement: SELECT day_table.* from day_table WHERE day_table.dt>= ' 2008-08-08 ';

View partition Statement : hive> show Partitions day_hour_table; OK dt=2008-08-08/hour=08 dt=2008-08-08/hour=09 dt=2008-08-09/hour=09
second, static and dynamic partition table

Partition is a mechanism provided by hive: The user determines how data is stored by specifying one or more partition keys to optimize the data query
A table can specify multiple partition keys, each of which exists as a folder in hive.

1, static partitioning (static partition):
Edit file:/home/work/data/test3.txt; /home/work/data/test4.txt;
$ cat/home/work/data/test3.txt
1,zxm
2,ljz
3,cds
4,mac
5,android
6,symbian
7,wp

$ cat/home/work/data/test4.txt
8,zxm
9,ljz
10,cds
11,mac
12,android
13,symbian
14,wp

Build table:
Hive> CREATE TABLE student_tmp (ID INT, name STRING)
> Partitioned by (Academy string, class string)
> Row format delimited fields terminated by ', ';
Ok
Time taken:6.505 seconds
Id,name are real columns, partition columns Academy and class are pseudo-columns

Load data: (Here the direct load data into partition, before hive 0.6, must first create a good partition, data to import)
hive> Load Data local inpath '/home/work/data/test3.txt ' into table student_tmp partition (academy= ' computer ', class= ' 034 ');
Copying data from File:/home/work/data/test3.txt
Copying File:file:/home/work/data/test3.txt
Loading data to table DEFAULT.STUDENT_TMP partition (Academy=computer, class=034)
Ok
Time taken:0.898 seconds
hive> Load Data local inpath '/home/work/data/test3.txt ' into table student_tmp partition (academy= ' physics ', class= ' 0 34 ');
Copying data from File:/home/work/data/test3.txt
Copying File:file:/home/work/data/test3.txt
Loading data to table DEFAULT.STUDENT_TMP partition (Academy=physics, class=034)
Ok
Time taken:0.256 seconds

To view the hive file structure:
$ Hadoop fs-ls/user/hive/warehouse/student_tmp/
Found 2 Items
Drwxr-xr-x-work supergroup 0 2012-07-30 18:47/user/hive/warehouse/student_tmp/academy=computer
Drwxr-xr-x-work supergroup 0 2012-07-30 19:00/user/hive/warehouse/student_tmp/academy=physics
$ Hadoop Fs-ls/user/hive/warehouse/student_tmp/academy=computer
Found 1 Items
Drwxr-xr-x-work supergroup 0 2012-07-30 18:47/user/hive/warehouse/student_tmp/academy=computer/class=034

Query data:
Hive> SELECT * from student_tmp where academy= ' physics ';
Ok
1 ZXM Physics 034
2 LJZ Physics 034
3 CDs Physics 034
4 mac Physics 034
5 Android Physics 034
6 Symbian Physics 034
7 WP Physics 034
Time taken:0.139 seconds

The above is an example of static partition, where static partition is the partition of the user-specified data, specifying partition (academy= ' computer ', class= ' 034 ') at load data ;
The static partition is often used as an example of using processing time as a partition key.
However, we often encounter the need to insert a large amount of data into a partitioned table, and partition before inserting a data destination, we need dynamic partition.
Use dynamic partitioning to set the Hive.exec.dynamic.partition parameter value to True.
You can set a partial column as dynamic partition, for example: partition (academy= ' computer ', Class);
You can also set all columns that are listed as dynamic partition, such as partition (Academy, Class);
When you set all columns that are listed as dynamic partition, you need to set hive.exec.dynamic.partition.mode=nonstrict
It should be noted that the primary partition is a dynamic partition column, and the secondary partition is static partition column is not allowed, such as partition (Academy, class= ' 034 '); is not allowed
2. Dynamic partitioning (partition):
Build table
Hive> CREATE TABLE student (ID INT, name STRING)
> Partitioned by (Academy string, class string)
> Row format delimited fields terminated by ', ';
Ok
Time taken:0.393 seconds

Setting parameters
Hive> set hive.exec.dynamic.partition.mode=nonstrict;
Hive> set hive.exec.dynamic.partition=true;

Import data:
hive> Insert Overwrite table student partition (Academy, Class)
> select Id,name,academy,class from Student_tmp
> Where class= ' 034 ';
Total MapReduce jobs = 2
.........
Ok
Time taken:29.616 seconds

Query data:

Hive> SELECT * FROM student where academy= ' physics ';
Ok
1 ZXM Physics 034
2 LJZ Physics 034
3 CDs Physics 034
4 mac Physics 034
5 Android Physics 034
6 Symbian Physics 034
7 WP Physics 034
Time taken:0.165 seconds

To view files:
$ Hadoop fs-ls/user/hive/warehouse/student/
Found 2 Items
Drwxr-xr-x-work supergroup 0 2012-07-30 19:22/user/hive/warehouse/student/academy=computer
Drwxr-xr-x-work supergroup 0 2012-07-30 19:22/user/hive/warehouse/student/academy=physics

3, Summary:
Hive partition is a means of reducing the size of data processing when querying operations by splitting data into different partition into different files.
For example, in a Hive select query, if no partition is built, the entire table content is scanned, which is a huge amount of computation. If we do partition in the corresponding dimension, the scale of processing data can be greatly reduced.
|
4, with partition related parameters:
hive.exec.dynamic.partition (default false): Set to True allows dynamic partition
Hive.exec.dynamic.partition.mode (default Strick): Sets the dynamic partition mode (nostrict allows all partition columns to be dynamic partition, Strict not allowed)
hive.exec.max.dynamic.partitions.pernode (default 100): The maximum number of partitions that each MapReduce job allows to create, If this number is exceeded, the error
hive.exec.max.dynamic.partitions (default 1000): Maximum number of partitions allowed to be created by a DML statement
Hive.exec.max.created.files (default 100000): The maximum number of files that are allowed to be created by all MapReduce jobs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.