Deep understanding of hive partitioning, partitioning is just a logical structure

Source: Internet
Author: User

On the internet there is a partition about the use of hive to explain the better, reproduced:

I. BACKGROUND

1, in the Hive Select query will generally scan the entire table content, will consume a lot of time to do unnecessary work. Sometimes you only need to scan a subset of the data in the table, so the partition concept is introduced when the table is built.

2. The partition table refers to the partition space of the partition specified when the table was created.

3, if you need to create a partitioned table, you need to call the CREATE table when the optional parameters partitioned by, see table creation syntax structure.

Second, technical details

1, a table can have one or more partitions, each partition as a folder in the form of a separate table folder directory.

2, table and column names are case-insensitive.

3, the partition is in the form of a field in the table structure exists, through the describe Table command can be viewed to the existence of a field, but the field does not hold the actual data content, is only the representation of the partition.

4. Build the syntax of the table (see partitioned by parameter for partitioning):

CREATE [EXTERNAL] TABLE [IF not EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT Table_comment] [Partitioned by (Col_name data_type [COMMENT col_comment], ...)] [CLUSTERED by (Col_name, col_name, ...) [SORTED by (Col_name [asc| DESC], ...)] into num_buckets buckets] [ROW FORMAT Row_format] [STORED as File_format] [location Hdfs_path]

5, the partition table is divided into 2 kinds, one is a single partition, that is, in the Table folder directory is only a folder directory. Another is multi-partition, the table folder under the Multi-folder nested mode.

A, single partition build Table statement: CREATE TABLE day_table (id int, content String) partitioned by (DT string), single partition table, partitioned by day, Id,content,dt three columns exist in the table structure.

b, two-partition statement: CREATE TABLE day_hour_table (id int, content String) partitioned by (DT string, hour string), double-partitioned table, partitioned by day and hour, new in table structure Added DT and hour two columns.

Table folder directory (multi-partition Table):


6. Add Partition table syntax (table created, add partition on this basis):

ALTER TABLE table_name ADD partition_spec [location ' Location1 '] partition_spec [location ' Location2 '] ... partition_s PEC:: PARTITION (Partition_col = partition_col_value, Partition_col = Partiton_col_value, ...)

Users can use ALTER table add PARTITION to add partitions to a table. Quoted when the partition name is a string. Cases:

ALTER TABLE day_table ADD PARTITION (dt= ' 2008-08-08 ', hour= ') location '/path/pv1.txt ' PARTITION (dt= ' 2008-08-08 ', hou R= ' ") location '/path/pv2.txt ';

7. Delete Partition syntax:

ALTER TABLE table_name DROP partition_spec, Partition_spec,...

The user can delete the partition with ALTER TABLE drop PARTITION. The metadata and data for the partition will be deleted together. Cases:

ALTER TABLE day_hour_table DROP PARTITION (dt= ' 2008-08-08 ', hour= ' 09 ');

8. The data is loaded into the partition table syntax:

LOAD DATA [LOCAL] inpath ' filepath ' [OVERWRITE] into TABLE tablename [PARTITION (Partcol1=val1, Partcol2=val2 ...)]

Cases:

LOAD DATA inpath '/user/pv.txt ' into TABLE day_hour_table PARTITION (dt= ' 2008-08-08 ', hour= ' 08 '); LOAD DATA local inpath '/user/hua/* ' into TABLE day_hour partition (dt= ' 2010-07-07 ');

When the data is loaded into the table, no transformations are made to the data. The load operation simply copies the data to the location corresponding to the Hive table. When the data is loaded, a directory is automatically created under the table, and the file is stored under that partition.

9. Partition-based query statements:

SELECT day_table.* from day_table WHERE day_table.dt>= ' 2008-08-08 ';

10. View the partition statement:

Hive> Show Partitions day_hour_table; OK dt=2008-08-08/hour=08 dt=2008-08-08/hour=09 dt=2008-08-09/hour=09

Iii. Summary

1. In Hive, one of the Partition in the table corresponds to a directory under the table, and all Partition data is stored in the directory of the most-word set.

2, generally partition is a supplementary query, narrow the scope of the query, speed up the retrieval of data and data in accordance with certain specifications and conditions for management.

——————————————————————————————————————

Operations on partition in hive:
Hive> CREATE TABLE MP (a string) partitioned by (b string, c string);
Ok
Time taken:0.044 seconds
Hive> ALTER TABLE MP add partition (b= ' 1 ', c= ' 1 ');
Ok
Time taken:0.079 seconds
Hive> ALTER TABLE MP add partition (b= ' 1 ', c= ' 2 ');
Ok
Time taken:0.052 seconds
Hive> ALTER TABLE MP add partition (b= ' 2 ', c= ' 2 ');
Ok
Time taken:0.056 seconds
Hive> Show Partitions MP;
Ok
B=1/c=1
b=1/c=2
b=2/c=2
Time taken:0.046 seconds
Hive> explain extended ALTER TABLE MP drop partition (b= ' 1 ');
Ok
ABSTRACT SYNTAX TREE:
(Tok_altertable_dropparts MP (Tok_partspec (tok_partval B ' 1 ')))

STAGE DEPENDENCIES:
Stage-0 is a root stage

STAGE plans:
stage:stage-0
Drop Table Operator:
Drop Table
Table:mp


Time taken:0.048 seconds
Hive> ALTER TABLE MP drop partition (b= ' 1 ');
Failed:error in metadata:table was partitioned but partition spec was not specified or tab: {b=1}
Failed:execution Error, return code 1 from Org.apache.hadoop.hive.ql.exec.DDLTask
Hive> Show Partitions MP;
Ok
B=1/c=1
b=1/c=2
b=2/c=2
Time taken:0.044 seconds
Hive> ALTER TABLE MP add partition (b= ' 1 ', c = ' 3 ') partition (b= ' 1 ', c= ' 4 ');
Ok
Time taken:0.168 seconds
Hive> Show Partitions MP;
Ok
B=1/c=1
b=1/c=2
B=1/c=3
B=1/c=4
b=2/c=2
B=2/c=3
Time taken:0.066 seconds
Hive>insert Overwrite table MP partition (b= ' 1 ', c= ' 1 ') select CNT from Tmp_et3;

Hive>alter Table MP Add columns (Newcol string);

Location-specific directory structure
Hive> ALTER TABLE ALTER2 add partition (insertdate= ' 2008-01-01 ') location ' 2008/01/01 ';

Hive> ALTER TABLE ALTER2 add partition (insertdate= ' 2008-01-02 ') location ' 2008/01/02 ';

Deep understanding of hive partitioning, partitioning is just a logical structure

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.