Hive Bucket Table

Source: Internet
Author: User

Buckets are the hash,hash of a specified column in a table or partition into a specified bucket, which enables efficient sampling work.

Sampling (sampling) can be sampled on the entire data so that the efficiency is naturally low, and it is still going to access all the data. If a table has already made buckets for a column, you can sample a bucket of the specified ordinal in all buckets, which reduces the amount of traffic.

There are four steps in total for the operation of the bucket:

1). Turn on the bucket service

Hive > Set hive.enforce.buketing=true;

2). Create a bucket table

First, let's see how to tell hive-that a table should be divided into buckets. We use the Clustered by clause to specify the number of columns and buckets to divide the buckets. The data in the bucket can be sorted by one or more columns sorted by. because of this, the connection to each bucket becomes an efficient merge sort (merge-sort), which can further improve the efficiency of the map-side connection.

Hive > CREATE table bucketed_user (id int, name string) clustered by (ID) sorted by (name) into 4 buckets row format Deli mited fields terminated by ' \ t ' stored as textfile;

3). Import data

Physically, each bucket is a file in the data table (or partition) directory. Its filename is not important, but bucket n is the nth file sorted by dictionary order. In fact, buckets correspond to the output file partition of MR: one job produces the same buckets (output file) and reduce task count.

Hive > INSERT OVERWRITE TABLE bucketed_users SELECT * from users;

4). Check the data condition

We can understand this by looking at the layout of the Bucketd_users table you just created. Run the following command:

Hive > Dfs–ls/user/hive/warehouse/bucketed_users;

4 new files are displayed. The file name is as follows (the file name contains a timestamp and is generated by hive, so each run changes):

Attempt_201005221636_0016_r_000000_0

Attempt_201005221636_0016_r_000001_0

Attempt_201005221636_0016_r_000002_0

Attempt_201005221636_0016_r_000003_0

5). View the data in the bucket and enter the following instructions

Hive > Dfs–cat/user/hive/warehouse/bucketed_users/*;

6). Sample the data in the bucket

We can get the same result by sampling the table with the tablesample clause . Instead of using the entire table, this clause restricts the query to a portion of the table's bucket:

Hive> SELECT * from Bucketed_users tablesample (buckets 1 out of 4 on ID);

For a large, evenly distributed set of data, this returns about One-fourth of the rows in the table.

We can also sample several buckets at other scales (since sampling is not an exact operation, so this ratio is not necessarily a multiple of the number of buckets).

Because the query only needs to read the buckets that match the TABLESAMPLE clause, sampling the bucket table is a very efficient operation. If you use the rand () function to sample a table that is not divided into buckets, scan the entire input dataset even if you only need to read a small subset of the samples:

Hive > SELECT * from Users tablesample (buckets 1 out of 4 on Rand ());

1). Set Environment variables

Set = true;

2). Create a bucket table

Use the Clustered by clause to specify the number of columns and buckets to divide the bucket. The data in the bucket can be sorted by one or more columns sortedby. because of this, the connection to each bucket becomes an efficient merge sort (merge-sort), which can further improve the efficiency of the map-side connection.

Hive> Create TableStudent0 (IDINT, ageINT, name STRING)>Partitioned by(stat_date STRING)> Clustered  by(ID) sorted by(age) into 2Buckets>Row format delimited fields terminated by ','; Oktime taken:0.292Seconds
Hive> Create TableStudent1 (IDINT, ageINT, name STRING)>Partitioned by(stat_date STRING)> Clustered  by(ID) sorted by(age) into 2Buckets>Row format delimited fields terminated by ','; Oktime taken:0.215Seconds

3). Inserting data

[[email protected] hive]# more Bucket.txt1, -, Zxm2, +, LJZ3, +, CDs4, -, Mac5, A, Android6, at, Symbian7, -Wp
 hive>  load  data local Inpath Span style= "color: #ff0000;" > '  
Hive> from student0                                                       >inserttable student1 Partition (Stat_date="20120802")     >selectwhere stat_date="20120802"                       > by age;

4). View the file directory

Hive>Dfs-Ls/User/Hive/Warehouse/Student1/Stat_date=20120802; Found2Items-rw-R--r--1 root supergroup 2015-08-17 21:23/user/hive/warehouse/student1/stat_date=20120802/000000_0-rw-R--r--1 root supergroup 2015-08-17 21:23/user/hive/warehouse/student1/stat_date=20120802/000001_0

5). View tablesample data

Hive> Select *  fromStudent1>Tablesample (Bucket1Out of 2  onID); OK6        atSymbian201208022        +Ljz201208024        -Mac20120802Time taken:10.871Seconds, fetched:3Row (s)

Note: Tablesample is a sample statement, Syntax: Tablesample (BUCKET x out of Y)

Y must be a multiple or a factor of the total number of buckets in the table. Hive determines the proportion of samples based on the size of Y. For example, table has a total of 64 parts, when y=32, extract (64/32=) 2 buckets of data, when y=128, extract (64/128=) 1/2 Buckets of data (this example is 1).

x indicates from which bucket to start the extraction. For example, table Total bucket is 32,tablesample (bucket 3 out of 16), representing a total of 2 buckets (32/16=) of data, respectively, 3rd buckets and (3+16=) 19 buckets of data.

Hive Bucket Table

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.