The partition table created in hive has no complex partition type (range partition, list partition, hash partition, mixed partition, etc.). A partitioned column is also not an actual field in a table, but one or more pseudo-columns. This means that the information and data of the partition column are not actually saved in the data file of the table.
The following statement creates a simple partition table:
CREATE TABLE Partition_test
(member_id string,
Name string
)
Partitioned by (
Stat_date String,
Province string)
ROW FORMAT delimited TERMINATED by ', ';
In this example, the Stat_date and province two fields are created as partition columns. Typically, you need to pre-create the partition before you can use it, for example:
Each partition will have a separate folder, and the following are all the data files for that partition. In this example stat_date is the main level, province is the sub-level, all stat_date= ' 20110728 ', and province different partitions will be in/user/hive/warehouse/partition_test/ stat_date=20110728 below, and stat_date different partitions will be under/user/hive/warehouse/partition_test/, such as:
Note that because the value of the partition column is to be converted to the folder's storage path, if the value of the partition column contains a special value, such as '% ', ': ', '/', ' # ', it will be escaped using the% plus 2-byte ASCII code, such as:
Hive> ALTER TABLE Partition_test add partition (stat_date= ' 2011/07/28 ', province= ' Zhejiang ');
Ok
Time taken:4.644 seconds
Then I insert the data into the Partition_test partition:
hive> Insert Overwrite table partition_test partition (stat_date= ' 20110728 ', province= ' Henan ') select Member_id,name From Partition_test_input where stat_date= ' 20110728 ' and province= ' Henan ';
Total MapReduce jobs = 2
...
1 Rows loaded to Partition_test
Ok
You can also insert data into multiple partitions at the same time, partitions that do not exist after version 0.7 are created automatically, and the official document of the previous version of 0.6 says that the partition must be pre-created:
Hive>
> From Partition_test_input
> Insert Overwrite table partition_test partition (stat_date= ' 20110526 ', province= ' Liaoning ')
> select Member_id,name where stat_date= ' 20110526 ' and province= ' Liaoning '
> Insert Overwrite table partition_test partition (stat_date= ' 20110728 ', province= ' Sichuan ')
> select Member_id,name where stat_date= ' 20110728 ' and province= ' Sichuan '
> Insert Overwrite table partition_test partition (stat_date= ' 20110728 ', province= ' Heilongjiang ')
> select Member_id,name where stat_date= ' 20110728 ' and province= ' Heilongjiang ';
Total MapReduce jobs = 4
...
3 Rows loaded to Partition_test
Ok
In particular, it is important to note that in other databases, when inserting data into a partitioned table, the system verifies that the data conforms to that partition, and if it does not, it will error. In hive, what data is inserted into a partition is completely controlled by the person, because the partition key is a pseudo-column and is not actually stored in the file, such as:
hive> Insert Overwrite table partition_test partition (stat_date= ' 20110527 ', province= ' Liaoning ') Select member_id, Name from Partition_test_input;
Total MapReduce jobs = 2
...
5 Rows loaded to Partition_test
Ok
Hive> SELECT * from partition_test where stat_date= ' 20110527 ' and province= ' Liaoning ';
Ok
1 Liujiannan 20110527 Liaoning
2 Wangchaoqun 20110527 Liaoning
3 xuhongxing 20110527 Liaoning
4 Zhudaoyong 20110527 Liaoning
5 Zhouchengyu 20110527 Liaoning
You can see that the 5 data in Partition_test_input have different stat_date and province, but in the insert to partition (stat_date= ' 20110527 ', province= ' Liaoning ') After this partition, the 5 data stat_date and province all become the same, because the data in these two columns is read from the folder name, not actually read from the data file:
The following describes the dynamic partition, because the above method to insert data into the partition table, if the source data volume is large, then for a partition to write an insert, very cumbersome. And in the previous version, you had to manually create all the partitions before you could insert them, which is even more troublesome, and you must first know what data in the source data is available to create the partition.
Using dynamic partitioning can be a good solution to these problems. Dynamic partitioning can be automatically matched to the corresponding partition based on the data obtained from the query.
Use dynamic partitioning to first set the Hive.exec.dynamic.partition parameter value to True, the default value is False, which is not allowed:
Hive> set hive.exec.dynamic.partition;
Hive.exec.dynamic.partition=false
Hive> set hive.exec.dynamic.partition=true;
Hive> set hive.exec.dynamic.partition;
Hive.exec.dynamic.partition=true
The use of dynamic partitioning is very simple, suppose I want to stat_date= ' 20110728 ' This partition below the data, as for province inserted into which sub-partition to let the database itself to judge, that can be written:
hive> Insert Overwrite table partition_test partition (stat_date= ' 20110728 ', province)
> select member_id,name,province from Partition_test_input where stat_date= ' 20110728 ';
Total MapReduce jobs = 2
...
3 Rows loaded to Partition_test
Ok
Stat_date is called a static partition column, and province is called a dynamic partition column. The dynamic partitioning columns need to be written out in the order of partitioning in the SELECT clause, and the static partitioning columns are not written out. So all the data of stat_date= ' 20110728 ' will be inserted into/user/hive/warehouse/partition_test/stat_date=20110728/according to the different province Under the different sub-folders, if the source data corresponding to the province sub-partition does not exist, it is automatically created, very convenient, and avoids manual control of the insertion of data and the mapping of the partition is a potential risk.
Note that dynamic partitioning does not allow the primary partition to take dynamic columns while the secondary partition takes a static column, which causes all primary partitions to create partitions that are defined by the secondary partition static column:
hive> Insert Overwrite table partition_test partition (stat_date,province= ' Liaoning ')
> select member_id,name,province from Partition_test_input where province= ' Liaoning ';
Failed:error in semantic analysis:line 1:48 Dynamic partition cannot being the parent of a static partition ' Liaoning '
Dynamic partitioning can allow all partition columns to be dynamic partitioning columns, but first set a parameter hive.exec.dynamic.partition.mode:
Hive> set Hive.exec.dynamic.partition.mode;
Hive.exec.dynamic.partition.mode=strict
Its default value is Strick, that is, the partitioning column is not allowed to be all dynamic, this is to prevent users may originally intended to be only in the sub-partition dynamic partition, but inadvertently forget to specify a value for the primary partition column, which will cause a DML statement to create a large number of new partitions (corresponding to a large number of new folders) Impact on system performance.
So we're going to set:
Hive> set Hive.exec.dynamic.partition.mode=nostrick;
3 more parameters are introduced:
Hive.exec.max.dynamic.partitions.pernode (default 100): The maximum number of partitions allowed to be created per MapReduce job, and an error if this number is exceeded
Hive.exec.max.dynamic.partitions (default 1000): Maximum number of all partitions that a DML statement allows to create
Hive.exec.max.created.files (default 100000): Maximum number of files that are allowed to be created by all mapreduce jobs
When the amount of source table data is large, the data generated in a single mapreduce job can be scattered on the partition column, for example, for example, the following table uses 3 maps:
1
1
1
2
2
2
3
3
3
If the data is distributed in this way, then each mapreduce only needs to create 1 partitions:
|
Map1 to |
|
The
MAP2----
The
|3
MAP3-|3
|3
But if the data is distributed as follows, the first MapReduce will create 3 partitions:
|
MAP1----
|3
|
MAP2----
|3
|
MAP3----
|3
An example of an error is given below:
Hive> set hive.exec.max.dynamic.partitions.pernode=4;
hive> Insert Overwrite table partition_test partition (stat_date,province)
> select member_id,name,stat_date,province from Partition_test_input distribute by stat_date,province;
Total MapReduce jobs = 1
...
[Fatal Error] Operator Fs_4 (id=4): Number of dynamic partitions exceeded Hive.exec.max.dynamic.partitions.pernode. Killing the job.
Ended Job = job_201107251641_0083 with errors
Failed:execution Error, return code 2 from Org.apache.hadoop.hive.ql.exec.MapRedTask
In order for the partition column to have the same value as possible in the same mapreduce, so that each mapreduce can produce as little as possible a new folder, you can use the Distribute by function, the partition column value of the same data together:
hive> Insert Overwrite table partition_test partition (stat_date,province)
> select member_id,name,stat_date,province from Partition_test_input distribute by stat_date,province;
Total MapReduce jobs = 1
...
Rows Loaded to Partition_test
Ok
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.