Hive bucket~

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The table in hive can be split into partition,table and partition can be sorted by ' sort by ' through ' CLUSTERED by ' further dividing the data in Bucket,bucket.
Bucket main function:
1. Data sampling
2. Improve the efficiency of some query operations, such as Mapside join
It is important to note that clustered by and sorted by do not affect the import of data, which means that users must themselves be responsible for how the data is imported, including the buckets and sorting of the data.
' Set hive.enforce.bucketing = True ' can automatically control the number of the previous round of reduce to fit the number of buckets, of course, Users can also set the mapred.reduce.tasks to fit the bucket number, it is recommended to use ' Set hive.enforce.bucketing = True '

Example:
Build the Temp table student_tmp and import the data:
hive> desc student_tmp;
Ok
ID int
Age int
Name string
Stat_date string
Time taken:0.106 seconds
Hive> select * from Student_tmp;
Ok
1 ZXM 20120801
2 LJZ 20120801
3 CDs 20120801
4 Mac 20120801
5 Android 20120801
6 Symbian 20120801
7 WP 20120801
Time taken:0.123 seconds

Build Student Table:
Hive>create table student (id int, age int, name STRING)
>partitioned by (Stat_date STRING)
> Clustered by (ID) sorted by (age) into 2 buckets
>row format delimited fields terminated by ', ';

Set Environment variables:
> Set hive.enforce.bucketing = true;

Insert data:

>from student_tmp
>insert Overwrite table student partition (stat_date= "20120802")
>select id,age,name where stat_date= "20120801" sort by age;

To view the file directory:
$ Hadoop fs-ls/user/hive/warehouse/studentstat_date=20120802/
Found 2 Items
-rw-r--r--1 work supergroup 2012-07-31 19:52/user/hive/warehouse/student/stat_date=20120802/000000_0
-rw-r--r--1 work supergroup 2012-07-31 19:52/user/hive/warehouse/student/stat_date=20120802/000001_0

View Sampling data: hive> SELECT * FROM Student tablesample (buckets 1 out of 2 on ID);
Total MapReduce jobs = 1
Launching Job 1 out of 1
.......
Ok
4 Mac 20120802
2 LJZ 20120802
6 Symbian 20120802
Time taken:20.608 seconds

Tablesample is a sample statement, Syntax: Tablesample (BUCKET x out of Y)

Y must be a multiple or a factor of the total number of buckets in the table. Hive determines the proportion of samples based on the size of Y. For example, table has a total of 64 parts, when y=32, extract (64/32=) 2 buckets of data, when y=128, extract (64/128=) 1/2 buckets of data. x indicates from which bucket to start the extraction. For example, table Total bucket is 32,tablesample (bucket 3 out of 16), representing a total of 2 buckets (32/16=) of data, respectively, 3rd buckets and (3+16=) 19 buckets of data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive bucket~

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support