Hive bucket~

Source: Internet
Author: User
The table in hive can be split into partition,table and partition can be sorted by ' sort by ' through ' CLUSTERED by ' further dividing the data in Bucket,bucket.
Bucket main function:
1. Data sampling
2. Improve the efficiency of some query operations, such as Mapside join
It is important to note that clustered by and sorted by do not affect the import of data, which means that users must themselves be responsible for how the data is imported, including the buckets and sorting of the data.
' Set hive.enforce.bucketing = True ' can automatically control the number of the previous round of reduce to fit the number of buckets, of course, Users can also set the mapred.reduce.tasks to fit the bucket number, it is recommended to use ' Set hive.enforce.bucketing = True '

Example:
Build the Temp table student_tmp and import the data:
hive> desc student_tmp;
Ok
ID int
Age int
Name string
Stat_date string
Time taken:0.106 seconds
Hive> select * from Student_tmp;
Ok
1 ZXM 20120801
2 LJZ 20120801
3 CDs 20120801
4 Mac 20120801
5 Android 20120801
6 Symbian 20120801
7 WP 20120801
Time taken:0.123 seconds

Build Student Table:
Hive>create table student (id int, age int, name STRING)
>partitioned by (Stat_date STRING)
> Clustered by (ID) sorted by (age) into 2 buckets
>row format delimited fields terminated by ', ';

Set Environment variables:
> Set hive.enforce.bucketing = true;

Insert data:

>from student_tmp
>insert Overwrite table student partition (stat_date= "20120802")
>select id,age,name where stat_date= "20120801" sort by age;

To view the file directory:
$ Hadoop fs-ls/user/hive/warehouse/studentstat_date=20120802/
Found 2 Items
-rw-r--r--1 work supergroup 2012-07-31 19:52/user/hive/warehouse/student/stat_date=20120802/000000_0
-rw-r--r--1 work supergroup 2012-07-31 19:52/user/hive/warehouse/student/stat_date=20120802/000001_0

View Sampling data: hive> SELECT * FROM Student tablesample (buckets 1 out of 2 on ID);
Total MapReduce jobs = 1
Launching Job 1 out of 1
.......
Ok
4 Mac 20120802
2 LJZ 20120802
6 Symbian 20120802
Time taken:20.608 seconds

Tablesample is a sample statement, Syntax: Tablesample (BUCKET x out of Y)

Y must be a multiple or a factor of the total number of buckets in the table. Hive determines the proportion of samples based on the size of Y. For example, table has a total of 64 parts, when y=32, extract (64/32=) 2 buckets of data, when y=128, extract (64/128=) 1/2 buckets of data. x indicates from which bucket to start the extraction. For example, table Total bucket is 32,tablesample (bucket 3 out of 16), representing a total of 2 buckets (32/16=) of data, respectively, 3rd buckets and (3+16=) 19 buckets of data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.