static partitions and dynamic partitioning in hive

Last Update:2018-07-24 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The partition table created in hive has no complex partition type (range partition, list partition, hash partition, mixed partition, etc.). A partitioned column is also not an actual field in a table, but one or more pseudo-columns. This means that the information and data of the partition column are not actually saved in the data file of the table.
The following statement creates a simple partition table:

CREATE TABLE Partition_test
(member_id string,
Name string
)
Partitioned by (
Stat_date String,
Province string)
ROW FORMAT delimited TERMINATED by ', ';

In this example, the Stat_date and province two fields are created as partition columns. Typically, you need to pre-create the partition before you can use it, for example:

ALTER TABLE Partition_test Add partition (stat_date= ' 20110728 ', province= ' Zhejiang ');

This creates a partition. At this point we will see that hive has created a corresponding folder in the HDFs store:

$ Hadoop fs-ls/user/hive/warehouse/partition_test/stat_date=20110728
Found 1 Items
Drwxr-xr-x-admin supergroup 0 2011-07-29 09:53/user/hive/warehouse/partition_test/stat_date=20110728/province= Zhejiang

Each partition will have a separate folder, and the following are all the data files for that partition. In this example stat_date is the main level, province is the sub-level, all stat_date= ' 20110728 ', and province different partitions will be in/user/hive/warehouse/partition_test/ stat_date=20110728 below, and stat_date different partitions will be under/user/hive/warehouse/partition_test/, such as:

$ Hadoop fs-ls/user/hive/warehouse/partition_test/
Found 2 Items
Drwxr-xr-x-admin supergroup 0 2011-07-28 19:46/user/hive/warehouse/partition_test/stat_date=20110526
Drwxr-xr-x-admin supergroup 0 2011-07-29 09:53/user/hive/warehouse/partition_test/stat_date=20110728

Note that because the value of the partition column is to be converted to the folder's storage path, if the value of the partition column contains a special value, such as '% ', ': ', '/', ' # ', it will be escaped using the% plus 2-byte ASCII code, such as:

Hive> ALTER TABLE Partition_test add partition (stat_date= ' 2011/07/28 ', province= ' Zhejiang ');
Ok
Time taken:4.644 seconds

$hadoop fs-ls/user/hive/warehouse/partition_test/
Found 3 Items
Drwxr-xr-x-admin supergroup 0 2011-07-29 10:06/user/hive/warehouse/partition_test/stat_date=2011% 2F07% 2F28
Drwxr-xr-x-admin supergroup 0 2011-07-28 19:46/user/hive/warehouse/partition_test/stat_date=20110526
Drwxr-xr-x-admin supergroup 0 2011-07-29 09:53/user/hive/warehouse/partition_test/stat_date=20110728

I am using a secondary non-partitioned table Partition_test_input to prepare the data to be inserted into the partition_test:

hive> desc partition_test_input;
Ok
Stat_date string
member_id string
Name string
Province string

Hive> select * from Partition_test_input;
Ok
20110526 1 Liujiannan Liaoning
20110526 2 Wangchaoqun Hubei
20110728 3 xuhongxing Sichuan
20110728 4 Zhudaoyong Henan
20110728 5 Zhouchengyu Heilongjiang

Then I insert the data into the Partition_test partition:

hive> Insert Overwrite table partition_test partition (stat_date= ' 20110728 ', province= ' Henan ') select Member_id,name From Partition_test_input where stat_date= ' 20110728 ' and province= ' Henan ';
Total MapReduce jobs = 2
...
1 Rows loaded to Partition_test
Ok

You can also insert data into multiple partitions at the same time, partitions that do not exist after version 0.7 are created automatically, and the official document of the previous version of 0.6 says that the partition must be pre-created:

Hive>
> From Partition_test_input
> Insert Overwrite table partition_test partition (stat_date= ' 20110526 ', province= ' Liaoning ')
> select Member_id,name where stat_date= ' 20110526 ' and province= ' Liaoning '
> Insert Overwrite table partition_test partition (stat_date= ' 20110728 ', province= ' Sichuan ')
> select Member_id,name where stat_date= ' 20110728 ' and province= ' Sichuan '
> Insert Overwrite table partition_test partition (stat_date= ' 20110728 ', province= ' Heilongjiang ')
> select Member_id,name where stat_date= ' 20110728 ' and province= ' Heilongjiang ';
Total MapReduce jobs = 4
...
3 Rows loaded to Partition_test
Ok

In particular, it is important to note that in other databases, when inserting data into a partitioned table, the system verifies that the data conforms to that partition, and if it does not, it will error. In hive, what data is inserted into a partition is completely controlled by the person, because the partition key is a pseudo-column and is not actually stored in the file, such as:

hive> Insert Overwrite table partition_test partition (stat_date= ' 20110527 ', province= ' Liaoning ') Select member_id, Name from Partition_test_input;
Total MapReduce jobs = 2
...
5 Rows loaded to Partition_test
Ok

Hive> SELECT * from partition_test where stat_date= ' 20110527 ' and province= ' Liaoning ';
Ok
1 Liujiannan 20110527 Liaoning
2 Wangchaoqun 20110527 Liaoning
3 xuhongxing 20110527 Liaoning
4 Zhudaoyong 20110527 Liaoning
5 Zhouchengyu 20110527 Liaoning

You can see that the 5 data in Partition_test_input have different stat_date and province, but in the insert to partition (stat_date= ' 20110527 ', province= ' Liaoning ') After this partition, the 5 data stat_date and province all become the same, because the data in these two columns is read from the folder name, not actually read from the data file:

$ Hadoop Fs-cat/user/hive/warehouse/partition_test/stat_date=20110527/province=liaoning/000000_0
1,liujiannan
2,wangchaoqun
3,xuhongxing
4,zhudaoyong
5,zhouchengyu

The following describes the dynamic partition, because the above method to insert data into the partition table, if the source data volume is large, then for a partition to write an insert, very cumbersome. And in the previous version, you had to manually create all the partitions before you could insert them, which is even more troublesome, and you must first know what data in the source data is available to create the partition.
Using dynamic partitioning can be a good solution to these problems. Dynamic partitioning can be automatically matched to the corresponding partition based on the data obtained from the query.
Use dynamic partitioning to first set the Hive.exec.dynamic.partition parameter value to True, the default value is False, which is not allowed:

Hive> set hive.exec.dynamic.partition;
Hive.exec.dynamic.partition=false
Hive> set hive.exec.dynamic.partition=true;
Hive> set hive.exec.dynamic.partition;
Hive.exec.dynamic.partition=true

The use of dynamic partitioning is very simple, suppose I want to stat_date= ' 20110728 ' This partition below the data, as for province inserted into which sub-partition to let the database itself to judge, that can be written:

hive> Insert Overwrite table partition_test partition (stat_date= ' 20110728 ', province)
> select member_id,name,province from Partition_test_input where stat_date= ' 20110728 ';
Total MapReduce jobs = 2
...
3 Rows loaded to Partition_test
Ok

Stat_date is called a static partition column, and province is called a dynamic partition column. The dynamic partitioning columns need to be written out in the order of partitioning in the SELECT clause, and the static partitioning columns are not written out. So all the data of stat_date= ' 20110728 ' will be inserted into/user/hive/warehouse/partition_test/stat_date=20110728/according to the different province Under the different sub-folders, if the source data corresponding to the province sub-partition does not exist, it is automatically created, very convenient, and avoids manual control of the insertion of data and the mapping of the partition is a potential risk.

Note that dynamic partitioning does not allow the primary partition to take dynamic columns while the secondary partition takes a static column, which causes all primary partitions to create partitions that are defined by the secondary partition static column:

hive> Insert Overwrite table partition_test partition (stat_date,province= ' Liaoning ')
> select member_id,name,province from Partition_test_input where province= ' Liaoning ';
Failed:error in semantic analysis:line 1:48 Dynamic partition cannot being the parent of a static partition ' Liaoning '

Dynamic partitioning can allow all partition columns to be dynamic partitioning columns, but first set a parameter hive.exec.dynamic.partition.mode:

Hive> set Hive.exec.dynamic.partition.mode;
Hive.exec.dynamic.partition.mode=strict

Its default value is Strick, that is, the partitioning column is not allowed to be all dynamic, this is to prevent users may originally intended to be only in the sub-partition dynamic partition, but inadvertently forget to specify a value for the primary partition column, which will cause a DML statement to create a large number of new partitions (corresponding to a large number of new folders) Impact on system performance.
So we're going to set:
Hive> set Hive.exec.dynamic.partition.mode=nostrick;

3 more parameters are introduced:
Hive.exec.max.dynamic.partitions.pernode (default 100): The maximum number of partitions allowed to be created per MapReduce job, and an error if this number is exceeded
Hive.exec.max.dynamic.partitions (default 1000): Maximum number of all partitions that a DML statement allows to create
Hive.exec.max.created.files (default 100000): Maximum number of files that are allowed to be created by all mapreduce jobs

When the amount of source table data is large, the data generated in a single mapreduce job can be scattered on the partition column, for example, for example, the following table uses 3 maps:
1
1
1
2
2
2
3
3
3

If the data is distributed in this way, then each mapreduce only needs to create 1 partitions:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More