partitions and buckets of hive tables

Source: Internet
Author: User

1.Hive partition Table

When Hive uses a SELECT statement to query, it generally scans the entire table and consumes a lot of time to do unnecessary work. Hive can specify the partition space when creating the table, so it can improve the query efficiency when the query is made.

syntax for creating partitioned tables:

[Java]View PlainCopy
    1. CREATE TABLE TableName (
    2. Name string
    3. ) partitioned by (Key,type ...);


[Java]View PlainCopy
  1. drop table if exists employees;
  2. CREATE TABLE if not EXISTS employees (
  3. Name String,
  4. Salary Float,
  5. subordinate Array<string>
  6. Deductions map<string,float>
  7. Address struct<street:string,city:string,num:int>
  8. ) partitioned by (Date_time String,type string)
  9. Row format delimited fields terminated by ' \ t '
  10. Collection items terminated by ', '
  11. Map keys terminated by ': '
  12. Lines terminated by ' \ n '
  13. Stored as Textfile
  14. Location '/hive/inner ';

Attached: The above statement indicates that the Date_time and type two partitions are also called two partitions when the table is built, and a partition is called a single partition, and when the above statement is executed, we can see that the result of the table is more than two fields of the partition.

[Java]View PlainCopy
    1. DESC employees;

The results are as follows:

Note: The performance in the file system is date_time as a folder, and the type is a date_time subfolder.

Inserting data into a partitioned table (to specify a partition)

[Java]View PlainCopy
    1. hive> Load Data local inpath '/usr/local/src/employee_data ' into table employees partition (Date_time='  2015-01_24 ', type=' userInfo ');
    2. Copying data from File:/usr/local/src/employee_data
    3. Copying File:file:/usr/local/src/employee_data
    4. Loading data to Table default.employees partition (date_time=2015-01_24, type=userinfo)
    5. Ok
    6. Time taken: 0.22 seconds
    7. Hive>

After the data is inserted, it appears in the file system as:

Note: From here we can find that the type partition exists as a subfolder.

To add a partition:

[Java]View PlainCopy
    1. ALTER TABLE employees add if not exists partition (date_time=' 2088-08-18 ', type=' liaozhongmin ');

Note: We can add the partition first and then add the data to the corresponding partition.

To view partitions:

[Java]View PlainCopy
    1. Show partitions employees;

Attached: Employees here represents the table name.

Delete the unwanted partitions

[Java]View PlainCopy
    1. ALTER TABLE Employees drop if exists partition (date_time=' 2015-01_24 ', type=' userInfo ');

To view the partitions again:

2.Hive Bucket Table

For each table or partition, hive can be further organized into buckets, which means that the buckets are more granular data range divisions. Hive is a bucket for a column. Hive uses a hash of the column values and then divides the number of buckets to determine which bucket the record is stored in. The benefit of the buckets is that higher query processing efficiency can be achieved. Make sampling more efficient.


[Java]View PlainCopy
    1. CREATE TABLE Bucketed_user (
    2. ID int,
    3. Name string
    4. )
    5. Clustered by (ID) sorted by (name) into 4 buckets
    6. Row format delimited fields terminated by ' \ t '
    7. stored as textfile;

We use the user ID to determine how to divide the bucket (hive uses a hash of the value and divides the result by the number of buckets to take the remainder)

Another problem to note is that when using the bucket table we have to open the bucket table:

[Java]View PlainCopy
    1. Set hive.enforce.bucketing = true;

Now we will query the table employees in name and salary and insert it into this table:

[Java]View PlainCopy
    1. Insert Overwrite table Bucketed_user select Salary,name from Employees;

We can view the inserted data through the query statement:

The data is represented in the file as follows, divided into four buckets:

When querying from a bucket table, hive calculates the data stored in buckets based on the field of the bucket, and then goes directly to the corresponding bucket to fetch the data, which improves efficiency.

Partition and bucket of hive table

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.