There are no complex partition types (range partitions, list partitions, hash partitions, and hybrid partitions) to create partition tables in hive ). Partition columns are not an actual field in the table, but one or more pseudo columns. This means that the partition column information and data are not saved in the table data file.
The following statement creates a simple partition table:
Create Table partition_test
(Member_id string,
Name string
)
Partitioned (
Stat_date string,
Province string)
Row format delimited fields terminated ',';
In this example, the stat_date and province fields are created as partition columns. Generally, you must create a partition before using it. For example:
Alter table partition_test add partition (stat_date = '201312', Province = 'zhejiang ');
In this way, a partition is created. At this time, we will see that hive created a corresponding folder in HDFS storage:
$ Hadoop FS-ls/user/hive/warehouse/partition_test/stat_date = 20110728
Found 1 items
Drwxr-XR-X-Admin supergroup 0/user/hive/warehouse/partition_test/stat_date = 20110728/province = Zhejiang
Each partition has an independent folder. below is all the data files in the partition. In this example, stat_date is the primary level, Province is the secondary level, and all stat_date = '123 ', different partitions of province will be under/user/hive/warehouse/partition_test/stat_date = 20110728, while different partitions of stat_date will be under/user/hive/warehouse/partition_test, for example:
$ Hadoop FS-ls/user/hive/warehouse/partition_test/
Found 2 items
Drwxr-XR-X-Admin supergroup 0/user/hive/warehouse/partition_test/stat_date = 20110526
Drwxr-XR-X-Admin supergroup 0 2011-07-29 09:53/user/hive/warehouse/partition_test/stat_date = 20110728
Note: because the value of the partition column must be converted to the storage path of the folder, if the value of the partition column contains special values, such as '% ',':','/', '#', which will be escaped using % plus 2-byte ASCII code, such:
Hive> alter table partition_test add partition (stat_date = '2017/28', Province = 'zhejiang ');
OK
Time taken: 4.644 seconds
$ Hadoop FS-ls/user/hive/warehouse/partition_test/
Found 3 items
Drwxr-XR-X-Admin supergroup 0/user/hive/warehouse/partition_test/stat_date = 2011% 2f07% 2f28
Drwxr-XR-X-Admin supergroup 0/user/hive/warehouse/partition_test/stat_date = 20110526
Drwxr-XR-X-Admin supergroup 0 2011-07-29 09:53/user/hive/warehouse/partition_test/stat_date = 20110728
I use a secondary non-Partition Table partition_test_input to insert data to partition_test:
Hive> DESC partition_test_input;
OK
Stat_date string
Member_id string
Name string
Province string
Hive> select * From partition_test_input;
OK
20110526 1 liujiannan Liaoning
20110526 2 wangchaoqun Hubei
20110728 3 xuhongxing Sichuan
20110728 4 zhudaoyong Henan
20110728 5 zhouchengyu Heilongjiang
Then I insert data into the partition of partition_test:
Hive> insert overwrite table partition_test partition (stat_date = '000000', Province = 'henanc') Select member_id, name from partition_test_input where stat_date = '2016' and province = 'henan ';
Total mapreduce jobs = 2
...
1 rows loaded to partition_test
OK
You can also insert data into multiple partitions at the same time. partitions that do not exist after version 0.7 are automatically created. partitions must be created in advance in the official documents of versions earlier than version 0.6:
Hive>
> From partition_test_input
> Insert overwrite table partition_test partition (stat_date = '000000', Province = 'shanghaioning ')
> Select member_id, name where stat_date = '2013' and province = 'shanghaioning'
> Insert overwrite table partition_test partition (stat_date = '000000', Province = 'sichuany ')
> Select member_id, name where stat_date = '2013' and province = 'sichuanc'
> Insert overwrite table partition_test partition (stat_date = '000000', Province = 'heilongjiang ')
> Select member_id, name where stat_date = '20160301' and province = 'heilongjiang ';
Total mapreduce jobs = 4
...
3 rows loaded to partition_test
OK
Note that when inserting data into a partition table in other databases, the system checks whether the data conforms to the partition. If the data does not match, an error is returned. In Hive, what data is inserted into a partition is completely controlled by people, because the partition key is a pseudo column and is not actually stored in the file, such:
Hive> insert overwrite table partition_test partition (stat_date = '000000', Province = 'shanghaioning') Select member_id, name from partition_test_input;
Total mapreduce jobs = 2
...
5 rows loaded to partition_test
OK
Hive> select * From partition_test where stat_date = '2013' and province = 'shanghaioning ';
OK
1 liujiannan 20110527 Liaoning
2 wangchaoqun 20110527 Liaoning
3 xuhongxing 20110527 Liaoning
4 zhudaoyong 20110527 Liaoning
5 zhouchengyu20110527 Liaoning
We can see that the five data entries in partition_test_input have different stat_date and province, but after being inserted to the partition (stat_date = '2013', Province = 'shanghaioning, the stat_date and province of the five data entries are the same, because the data in these two columns is read according to the folder name, rather than actually read from the data file:
$ Hadoop FS-CAT/user/hive/warehouse/partition_test/stat_date = 20110527/province = Liaoning/000000_0
1, liujiannan
2, wangchaoqun
3, xuhongxing
4, zhudaoyong
5, zhouchengyu
Next we will introduce dynamic partitions, because the above method inserts data into the partition table. If the source data volume is large, it is very troublesome to write an insert statement for a partition. In earlier versions, all partitions must be manually created before insertion. This is even more troublesome, you must first know what data exists in the source data before creating a partition.
Dynamic partitioning can solve the above problems. Dynamic partitions can be automatically matched to corresponding partitions Based on the queried data.
To use a dynamic partition, you must first set the value of hive.exe C. Dynamic. Partition to true. The default value is false, that is, it is not allowed to use:
Hive> set hive.exe C. Dynamic. partition;
Hive.exe C. Dynamic. Partition = false
Hive> set hive.exe C. Dynamic. Partition = true;
Hive> set hive.exe C. Dynamic. partition;
Hive.exe C. Dynamic. Partition = true
The usage of dynamic partitions is very simple. Suppose I want to insert data to the stat_date = '201312' partition, and let the database determine the subpartition to which province is inserted, you can write it like this:
Hive> insert overwrite table partition_test partition (stat_date = '2013', Province)
> Select member_id, name, Province from partition_test_input where stat_date = '123 ';
Total mapreduce jobs = 2
...
3 rows loaded to partition_test
OK
Stat_date is called a static partition column, and province is called a dynamic partition column. In the select clause, the dynamic partition column needs to be written in the partition order, and the static partition column does not need to be written. In this way, all stat_date = '123' data will be inserted to different subfolders under/user/hive/warehouse/partition_test/stat_date = 20110728/according to different province values, if the province subpartition corresponding to the source data does not exist, it is automatically created, which is very convenient and avoids the potential risk of manually controlling the ing between the inserted data and the partition.
Note: Dynamic partitions do not allow the primary partition to use dynamic columns and the secondary partition to use static columns. This will cause all primary partitions to create the partitions defined by the static column of the secondary partition:
Hive> insert overwrite table partition_test partition (stat_date, Province = 'liaoning ')
> Select member_id, name, Province from partition_test_input where province = 'liaoning ';
Failed: Error in semantic analysis: Line Dynamic partition cannot be the parent of a static partition 'liaoning'
Dynamic partition allows all partition columns to be dynamic, but you must first set hive.exe C. Dynamic. Partition. Mode:
Hive> set hive.exe C. Dynamic. Partition. mode;
Hive.exe C. Dynamic. Partition. mode = strict
Its default value is strick, that is, it does not allow all partition columns to be dynamic. This is to prevent users from dynamically creating partitions only in subpartitions, however, the primary partition column is neglected to specify a value, which causes a DML statement to create a large number of new partitions (corresponding to a large number of new folders) in a short time ), impact on system performance.
So we need to set:
Hive> set hive.exe C. Dynamic. Partition. mode = nostrick;
Next, we will introduce three parameters:
Hive.exe C. Max. Dynamic. partitions. pernode (100 by default): Maximum number of partitions allowed to be created in each mapreduce job. If this number is exceeded, an error is returned.
Hive.exe C. Max. Dynamic. partitions (1000 by default): the maximum number of partitions allowed to be created by a DML statement.
Hive.exe C. Max. Created. Files (100000 by default): the maximum number of files that can be created by all mapreduce jobs
When the data volume in the source table is large, the data generated in a separate mapreduce job may be scattered in the partition column. For example, the following table uses three maps:
1
1
1
2
2
2
3
3
3
If the data is distributed like this, you only need to create one partition for each mapreduce:
| 1
Map1 --> | 1
| 1
| 2
MAP2 --> | 2
| 2
| 3
Map3 --> | 3
| 3
However, if the data is distributed as follows, the first mapreduce will create three partitions:
| 1
Map1 --> | 2
| 3
| 1
MAP2 --> | 2
| 3
| 1
Map3 --> | 2
| 3
The following is an example of an error:
Hive> set hive.exe C. Max. Dynamic. partitions. pernode = 4;
Hive> insert overwrite table partition_test partition (stat_date, Province)
> Select member_id, name, stat_date, Province from partition_test_input distribute by stat_date, Province;
Total mapreduce jobs = 1
...
[Fatal error] OPERATOR fs_4 (ID = 4): number of dynamic partitions exceeded hive.exe C. Max. Dynamic. partitions. pernode .. killing the job.
Ended job = job_201%251641_0083 with errors
Failed: execution error, return code 2 from org.apache.hadoop.hive.ql.exe C. mapredtask
In order to make the data with the same partition column value in the same mapreduce as much as possible, so that each mapreduce can generate as few folders as possible, with the help of the distribute by function, put data with the same partition column values together:
Hive> insert overwrite table partition_test partition (stat_date, Province)
> Select member_id, name, stat_date, Province from partition_test_input distribute by stat_date, Province;
Total mapreduce jobs = 1
...
18 rows loaded to partition_test
OK