Hive Bucket Column Bucketedtables

Source: Internet
Author: User

The CLUSTERED by and SORTED by creation commands does not affect how data are inserted into a table–only what it is read. This means the users must is careful to insert data correctly by specifying the number of reducers to be equalto the number of buckets, and using CLUSTER by and SORT by commands in their query.

In general, distributing rows based on the hash would give you a even distribution (evenly distributed) in the buckets.

Set mapred.reduce.tasks = 3;

Set hive.enforce.bucketing = true;

CREATE TABLE user_info_bucketed (user_id BIGINT, FirstName string, LastName String)

COMMENT ' A bucketed copy of User_info '

Partitioned by (DS STRING)

CLUSTERED by (user_id) into 3 BUCKETS;

INSERT into TABLE user_info_bucketed

PARTITION (ds= ' 2015-07-25 ')

Values

(+, ' python ', ' PostgreSQL '), (101, ' Python ', ' PostgreSQL '), (102, ' Python ', ' PostgreSQL '), (103, ' Python ', ' PostgreSQL ') ), (104, ' Python ', ' PostgreSQL '), (106, ' Python ', ' PostgreSQL '), (107, ' Python ', ' PostgreSQL '), PostgreSQL '), (108, ' Python ', ' PostgreSQL '), (109, ' Python ', ' PostgreSQL '), (111, ' Python ', ' PostgreSQL '), ("The ' Python '), , ' PostgreSQL '), (113, ' Python ', ' PostgreSQL '), ("(a), ' Python ', ' PostgreSQL '), (+/, ' python ', ' PostgreSQL '), (116, ' Python ', ' PostgreSQL '), (117, ' Python ', ' PostgreSQL '), (118, ' Python ', ' PostgreSQL '), (119, ' Python ', ' PostgreSQL '), (120 , ' python ', ' PostgreSQL '), (121, ' Python ', ' PostgreSQL '), (122, ' Python ', ' PostgreSQL '), (+, ' r ', ' Oracle '), (2001, ' R ', ' Oracle '), (2002, ' R ', ' Oracle '), (2003, ' R ', ' Oracle '), (2004, ' R ', ' Oracle '), (2005, ' R ', ' Oracle '), (2006, ' R ', ' Oracle '), (+, ' r ', ' Oracle '), (+, ' r ', ' Oracle '), ("R", ' Oracle '), (+, ' r ', ' Oracle '), (+, ' r ', ' Oracle '), ("R", ' Oracle '), (+, ' r ', ' Oracle '), ("R", ' Oracle '), (+, ' r ', ' Oracle '), (+, ' r ', ' Oracle '), (2017, ' R ', ' Oracle '), (2018, ' R ', ' Oracle '), (2019, ' R ', ' Oracle '), (2020, ' r ', ' Oracle '), (2030, ' r ', ' Oracle '), (2040, ' R ', ' Oracle '), (2050, ' R ', ' Oracle ');

[Spark01 ~]$ Hadoop fs-ls-r/user/hive/warehouse/test.db/user_info_bucketed
Drwxrwxrwx-huai SuperGroup 0 2015-07-20 22:46/user/hive/warehouse/test.db/user_info_bucketed/ds=2015-07-25
-RWXRWXRWX 3 Huai supergroup 266 2015-07-20 22:46/user/hive/warehouse/test.db/user_info_bucketed/ds=2015-07-25/0 00000_0
-RWXRWXRWX 3 Huai SuperGroup 288 2015-07-20 22:46/user/hive/warehouse/test.db/user_info_bucketed/ds=2015-07-25/0 00001_0
-RWXRWXRWX 3 Huai supergroup 266 2015-07-20 22:46/user/hive/warehouse/test.db/user_info_bucketed/ds=2015-07-25/0 00002_0

[Spark01 ~]$ Hadoop fs-cat/user/hive/warehouse/test.db/user_info_bucketed/ds=2015-07-25/000000_0 |sort
102pythonpostgresql
105pythonpostgresql
108pythonpostgresql
111pythonpostgresql
114pythonpostgresql
117pythonpostgresql
120pythonpostgresql
2001ROracle
2004ROracle
2007ROracle
2010ROracle
2013ROracle
2016ROracle
2019ROracle
2040ROracle
[Spark01 ~]$ Hadoop fs-cat/user/hive/warehouse/test.db/user_info_bucketed/ds=2015-07-25/000001_0 |sort
100pythonpostgresql
103pythonpostgresql
106pythonpostgresql
109pythonpostgresql
112pythonpostgresql
115pythonpostgresql
118pythonpostgresql
121pythonpostgresql
2002ROracle
2005ROracle
2008ROracle
2011ROracle
2014ROracle
2017ROracle
2020ROracle
2050ROracle
[Spark01 ~]$ Hadoop fs-cat/user/hive/warehouse/test.db/user_info_bucketed/ds=2015-07-25/000002_0 |sort
101pythonpostgresql
104pythonpostgresql
107pythonpostgresql
113pythonpostgresql
116pythonpostgresql
119pythonpostgresql
122pythonpostgresql
2000ROracle
2003ROracle
2006ROracle
2009ROracle
2012ROracle
2015ROracle
2018ROracle
2030ROracle

Hive Bucket Column Bucketedtables

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.