Hive Learning Notes (iii)

Last Update:2015-08-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

partitions and Buckets:

Hive organizes the table into constituent areas, a mechanism that is roughly divided according to the values of the partitioned columns. Using partitions can speed up the query of data shards.
A table or partition can be further divided into buckets, which provide additional data structures to make it more efficient for query processing. For example, by dividing buckets by user ID, we can quickly calculate user-based queries on random samples of all user collections.

Partition:
A partition is defined with a partitioned by clause when the table is created. The clause needs to define a list of columns.

create table logs(ts String,line String )PARTITIONED BY(dt String,country String);

To display the specified partition value when data is loaded into a partitioned table

load data local inpath ‘input/hive/partitions/file1‘into  table logs partition (dt=‘20150824‘,country=‘CN‘) ;

You can use ALTER TABLE to add and remove partitions after the table.

how to know which partitions are in the table :

show partitions  logs;

In the SELECT statement, the partition column is used in normal mode, and hive crops the input to scan only the relevant partitions.

select ts,dt,line from logs   where  country=‘GB‘;

Returns the value of the DT partition column, which is read from the directory because they do not exist in the data file.

barrels:
How does hive divide buckets?
Use the CLUSTERED by clause to divide the number of buckets used to columns buckets:

create table b_user (id int ,name String)CLUSTERED BY (id) into  4 buckets;

The data in the bucket is sorted according to one or more columns, and each bucket connection becomes an efficient merge sort, increasing the efficiency of the map connection:

create table b_user(id int,name string)CLUSTRED  BY  (id) SORTED BY (id ASC) INTO 4   BUCKETS;

This way, hive does not check that the buckets in the data file are consistent with the buckets in the table definition. The two are inconsistent, and only errors and undefined results can be found at query time.

To populate a bucket table with members, you need to set the Hive.enforce.bucketing property to true so that hive knows to create buckets with the number defined by the table.

insert overwrite  table b_user select * from users;

Each bucket is a file in the table (or partition) directory, and a job produces the same number of buckets and reduce tasks.

A table can be sampled with the TABLESAMPLE clause, which is scoped to a portion of the bucket instead of using the entire table.

select * from b_userTABLESAMPLE(BUCKET 1 out of 4 ON id );

Such a query would return half of the buckets:

select * from b_userTABLESAMPLE(BUCKET 1 out of 2 ON id );

The sample bucket table is a very efficient operation, and the query simply reads and TABLESAMPLE clauses that match the bucket. Use the rand () function to sample a table that is not divided into buckets, even if only a small subset of the samples need to be read, and the entire input dataset is scanned.

select  *  from usersTABLESAMPLE (BUCKET　1 out of 4 on rand());

Summary:

Partitioning is a mechanism for roughly dividing the values of partitioned columns, which can speed up queries. Buckets are partitioned on the basis of tables or partitions, providing more efficient query processing. Learn the definition of partitions and buckets, and understand their mechanisms.

Hive Learning Notes (iii)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive Learning Notes (iii)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support