Hive Learning Notes (iii)

Source: Internet
Author: User

partitions and Buckets:

Hive organizes the table into constituent areas, a mechanism that is roughly divided according to the values of the partitioned columns. Using partitions can speed up the query of data shards.
A table or partition can be further divided into buckets, which provide additional data structures to make it more efficient for query processing. For example, by dividing buckets by user ID, we can quickly calculate user-based queries on random samples of all user collections.

Partition:
A partition is defined with a partitioned by clause when the table is created. The clause needs to define a list of columns.

create table logs(ts String,line String )PARTITIONED BY(dt String,country String);

To display the specified partition value when data is loaded into a partitioned table

load data local inpath ‘input/hive/partitions/file1‘into  table logs partition (dt=‘20150824‘,country=‘CN‘) ;

You can use ALTER TABLE to add and remove partitions after the table.

how to know which partitions are in the table :

show partitions  logs;

In the SELECT statement, the partition column is used in normal mode, and hive crops the input to scan only the relevant partitions.

select ts,dt,line from logs   where  country=‘GB‘;  

Returns the value of the DT partition column, which is read from the directory because they do not exist in the data file.

barrels:
How does hive divide buckets?
Use the CLUSTERED by clause to divide the number of buckets used to columns buckets:

create table b_user (id int ,name String)CLUSTERED BY (id) into  4 buckets;

The data in the bucket is sorted according to one or more columns, and each bucket connection becomes an efficient merge sort, increasing the efficiency of the map connection:

create table b_user(id int,name string)CLUSTRED  BY  (id) SORTED BY (id ASC) INTO 4   BUCKETS;

This way, hive does not check that the buckets in the data file are consistent with the buckets in the table definition. The two are inconsistent, and only errors and undefined results can be found at query time.

To populate a bucket table with members, you need to set the Hive.enforce.bucketing property to true so that hive knows to create buckets with the number defined by the table.

insert overwrite  table b_user select * from users;

Each bucket is a file in the table (or partition) directory, and a job produces the same number of buckets and reduce tasks.

A table can be sampled with the TABLESAMPLE clause, which is scoped to a portion of the bucket instead of using the entire table.

select * from b_userTABLESAMPLE(BUCKET 1 out of 4 ON id );

Such a query would return half of the buckets:

select * from b_userTABLESAMPLE(BUCKET 1 out of 2 ON id );

The sample bucket table is a very efficient operation, and the query simply reads and TABLESAMPLE clauses that match the bucket. Use the rand () function to sample a table that is not divided into buckets, even if only a small subset of the samples need to be read, and the entire input dataset is scanned.

select  *  from usersTABLESAMPLE (BUCKET 1 out of 4 on rand());
Summary:

Partitioning is a mechanism for roughly dividing the values of partitioned columns, which can speed up queries. Buckets are partitioned on the basis of tables or partitions, providing more efficient query processing. Learn the definition of partitions and buckets, and understand their mechanisms.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Hive Learning Notes (iii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.