The table in hive can be split into partition,table and partition can be sorted by ' sort by ' through ' CLUSTERED by ' further dividing the data in Bucket,bucket.
Bucket main function:
1. Data sampling
2. Improve the efficiency of some query operations, such as Mapside join
It is important to note that clustered by and sorted by do not affect the import of data, which means that users must themselves be responsible for how the data is imported, including the buckets and sorting of the data.
' Set hive.enforce.bucketing = True ' can automatically control the number of the previous round of reduce to fit the number of buckets, of course, Users can also set the mapred.reduce.tasks to fit the bucket number, it is recommended to use ' Set hive.enforce.bucketing = True '
Example:
Build the Temp table student_tmp and import the data:
hive> desc student_tmp;
Ok
ID int
Age int
Name string
Stat_date string
Time taken:0.106 seconds
Hive> select * from Student_tmp;
Ok
1 ZXM 20120801
2 LJZ 20120801
3 CDs 20120801
4 Mac 20120801
5 Android 20120801
6 Symbian 20120801
7 WP 20120801
Time taken:0.123 seconds
Build Student Table:
Hive>create table student (id int, age int, name STRING)
>partitioned by (Stat_date STRING)
> Clustered by (ID) sorted by (age) into 2 buckets
>row format delimited fields terminated by ', ';
Set Environment variables:
>
Set hive.enforce.bucketing = true;
Insert data:
>from student_tmp
>insert Overwrite table student partition (stat_date= "20120802")
>select id,age,name where stat_date= "20120801" sort by age;
To view the file directory:
$ Hadoop fs-ls/user/hive/warehouse/studentstat_date=20120802/
Found 2 Items
-rw-r--r--1 work supergroup 2012-07-31 19:52/user/hive/warehouse/student/stat_date=20120802/000000_0
-rw-r--r--1 work supergroup 2012-07-31 19:52/user/hive/warehouse/student/stat_date=20120802/000001_0
View Sampling data: hive> SELECT * FROM Student tablesample (buckets 1 out of 2 on ID);
Total MapReduce jobs = 1
Launching Job 1 out of 1
.......
Ok
4 Mac 20120802
2 LJZ 20120802
6 Symbian 20120802
Time taken:20.608 seconds
Tablesample is a sample statement, Syntax: Tablesample (BUCKET x out of Y)
Y must be a multiple or a factor of the total number of buckets in the table. Hive determines the proportion of samples based on the size of Y. For example, table has a total of 64 parts, when y=32, extract (64/32=) 2 buckets of data, when y=128, extract (64/128=) 1/2 buckets of data. x indicates from which bucket to start the extraction. For example, table Total bucket is 32,tablesample (bucket 3 out of 16), representing a total of 2 buckets (32/16=) of data, respectively, 3rd buckets and (3+16=) 19 buckets of data.