"Go" Hive Base: Partition, bucket, Sort Merge bucket Join

Last Update:2015-07-15 Source: Internet

Author: User

Tags table definition hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hive is now the most common and inexpensive solution for building data warehouses in the Big data era, although there are also other rising stars such as Impala, but the status of hive is not yet shaken in terms of functionality and stability.

In fact, this blog is mainly to talk about SMB join, join is the most core part of the whole mr/hive, is the part of each HADOOP/HIVE/DW RD must master, before there are several articles chatted to mr/hive in the join, in fact, the bottom is the same, just on Layer to do some encapsulation, if you do not know exactly how joins are, and how the underlying implementation, please refer to the following links:

Introduction to several schemes of two table joins in http://my.oschina.net/leejun2005/blog/95186 MapReduce

http://my.oschina.net/leejun2005/blog/111963 Hadoop Multi-table Join:map side Join Example

http://my.oschina.net/leejun2005/blog/158491 Hive & Performance Learning Notes

In the last link, there are two graphs:

The first two very good understanding, basically everyone will come into contact, but the last one, there may be classmates or relatively unfamiliar, the purpose of SMB is mainly to solve the big table and big table Join problem, the sub-barrel is actually the big table into a "small table", and then map-side Join solution, This is the typical idea of divide and conquer. Before we talk about SMB Join, we should review the underlying concepts.

1. Hive Partition Table

In a hive select query, the entire table content is generally scanned, and it consumes a lot of time to do unnecessary work. Sometimes you only need to scan a subset of the data in the table, so the partition concept is introduced when the table is built. A partitioned table refers to the partition space of the partition specified when the table was created.

Hive can manage the data in a column or column, and the so-called partitioning can be explained by the following example.
The current Internet application stores a large number of log files every day, a few g, dozens of g or even more is possible. Store the log, which must have a property that is the date the log was generated. When a partition is generated, it can be divided by the date column produced by the log. Treat each day's log as a partition.
The Data organization component area, mainly can improve the data query speed. It is up to the user to decide which partition the user stores for each record. That is, when the user loads the data, it must display the partition to which the part of the data is assigned.

1.1 Implementation Details

1, a table can have one or more partitions, each partition as a folder in the form of a separate table folder directory.
2, table and column names are case-insensitive.
3, the partition is in the form of a field in the table structure exists, through the describe Table command can be viewed to the existence of a field, but the field does not hold the actual data content, only the representation of the partition (pseudo-column).

1.2 Syntax

1. Create a partitioned table with DS as the partitioning column:
CREATE TABLE invites (ID int, name string) partitioned by (DS string) row format delimited-terminated by ' t ' stored As Textfile;
2. Add the data to the partition with the time of 2013-08-16:
Load data local inpath '/home/hadoop/desktop/data.txt ' Overwrite into table invites partition (ds= ' 2013-08-16 ');
3. Add the data to the partition with the time of 2013-08-20:
Load data local inpath '/home/hadoop/desktop/data.txt ' Overwrite into table invites partition (ds= ' 2013-08-20 ');
4. Query data from a partition:
SELECT * from invites where ds = ' 2013-08-12 ';
5. Add data to a partition in a partitioned table:
Insert Overwrite table invites partition (ds= ' 2013-08-12 ') select Id,max (name) from the test group by ID;
To view the specifics of the partition, use the command:
Hadoop fs-ls/home/hadoop.hive/warehouse/invites
Or:
Show Partitions tablename;

2. Hive Bucket

For each table or partition, hive can be further organized into buckets, which means that buckets are more granular data range divisions. Hive is also an organization of buckets for a column. Hive uses a hash of the column values, divided by the number of buckets, to determine which bucket the record is stored in.

There are two reasons to organize a table (or partition) into buckets (buckets):

(1) Get higher efficiency of query processing. The bucket adds an extra structure to the table, and Hive can take advantage of this structure when dealing with some queries. Specifically, connecting two tables that divide the buckets on the same column (which contains the connection columns) can be implemented efficiently using the MAP-side join (map-side join). such as join operations. For join operations, two tables have an identical column if both tables are bucket-operated. Then the bucket with the same column values will be saved as a JOIN operation, which can significantly reduce the amount of data to join.

(2) Make sampling (sampling) more efficient. When working with large datasets, it can be handy to run queries on a small subset of data in a dataset at the stage of developing and modifying queries.

1. Create a table with buckets:

CREATE TABLE Bucketed_user (ID int,name string) clustered by (ID) sorted by (name) into 4 buckets row format delimited field s terminated by ' \ t ' stored as textfile;
First, let's see how to tell hive-that a table should be divided into buckets. We use the clustered by clause to specify the number of buckets to divide by the columns used to divide buckets:

CREATE TABLE bucketed_user (id INT) name STRING)
CLUSTERED by (ID) into 4 BUCKETS;

Here, we use the user ID to determine how to divide the bucket (hive uses a hash of the value and divides the result by the number of buckets to take the remainder.) In this way, any bucket will have a random user collection (PS: can actually be said to be random, right?). ）。

For a map-side connection, two tables divide the buckets in the same way. Processing the mapper of a bucket in the left table knows that the matching rows in the right table are within the corresponding bucket. Therefore, mapper only needs to get the bucket (which is just a small portion of the data stored in the right table) to connect. This optimization method does not necessarily require that two tables must be the same number of buckets, the number of buckets in two tables is a multiple relationship can also. Use HIVEQL to connect two tables divided into buckets, see the Map Connections section (P400).

The data in a bucket can be sorted in addition to one or more columns. Because of this, the connection to each bucket becomes an efficient merge sort (merge-sort), which can further improve the efficiency of the map-side connection. The following syntax declares a table so that it uses a sort bucket:

CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED by (ID) SORTED by (ID ASC) into 4 BUCKETS;

How do we ensure that the data in the table is divided into buckets? It is certainly possible to load the data generated outside of hive into a table that is divided into buckets. In fact, it's easier for hive to divide buckets. This is usually done for an existing table.

Hive does not check whether a bucket in the data file matches the bucket in the table definition (either for the number of buckets or for the column that divides the buckets). If the two do not match, you may encounter errors or undefined results when you poll. Therefore, it is recommended that hive be manipulated to divide the buckets.

There is a user table that does not divide buckets:
Hive> SELECT * from users;
0 Nat
2 Doe
B Kay
4 Ann

2. Force multiple reduce to output:

To populate a bucket table with members, you need to set the Hive.enforce.bucketing property to True. ① this way, Hive knows to create buckets with the number declared in the table definition. You can then use the INSERT command. It is important to note that clustered by and sorted by do not affect the import of data, which means that users must themselves be responsible for how the data is imported, including the buckets and sorting of the data.
' Set hive.enforce.bucketing = True ' can automatically control the number of the previous round of reduce to fit the number of buckets, of course, Users can also set the mapred.reduce.tasks to fit the bucket number, it is recommended to use ' Set hive.enforce.bucketing = True '

3. Insert data into the table:

INSERT OVERWRITE TABLE bucketed_users SELECT * from users;

Physically, each bucket is a file in the table (or partition) directory. Its filename is not important, but bucket n is the nth file sorted by dictionary order. In fact, buckets correspond to the output file partition of MapReduce: One job produces the same buckets (output file) and reduce task count. We can understand this by looking at the layout of the Bucketd_users table you just created. Run the following command:

4. View the structure of the table:

Hive> dfs-ls/user/hive/warehouse/bucketed_users;
4 new files are displayed. The file name is as follows (the file name contains a timestamp and is generated by hive, so each run changes):
Attempt_201005221636_0016_r_000000_0
Attempt_201005221636_0016_r-000001_0
Attempt_201005221636_0016_r_000002_0
Attempt_201005221636_0016_r_000003_0
The first bucket includes the user Ido and 4, because the hash value of an int is the integer itself, divided here by the remainder of the number of buckets (4): ②

5. Read the data to see the data for each file:

Hive> Dfs-cat/user/hive/warehouse/bucketed_users/*0_0;
0 Nat
4 Ann

We can get the same result by sampling the table with the TABLESAMPLE clause. Instead of using the entire table, this clause restricts the query to a portion of the table's bucket:

6. Sample the data in the bucket:

Hive> SELECT * from Bucketed_users
> Tablesample (BUCKET 1 out of 4 on ID);
0 Nat
4 Ann

The number of buckets is counted starting from 1. Therefore, the previous query gets all the users from the first of the 4 buckets. For a large, evenly distributed set of data, this returns about One-fourth of the rows in the table. We can also sample several buckets at other scales (since sampling is not an exact operation, so this ratio is not necessarily a multiple of the number of buckets). For example, the following query returns half of the buckets:

7. Query half the number of barrels returned:

Hive> SELECT * from Bucketed_users
> Tablesample (BUCKET 1 out of 2 on ID);
0 Nat
4 Ann
2 Joe

Because the query only needs to read the buckets that match the TABLESAMPLE clause, sampling the bucket table is a very efficient operation. If you use the rand () function to sample a table that is not divided into buckets, scan the entire input dataset even if you only need to read a small subset of the samples:

Hive〉select * from users
> Tablesample (BUCKET 1 out of 4 on Rand ());
2 Doe

① starting with Hive 0.6.0, mapred.reduce. Tasks must be set to the number of buckets in the table to be populated in the previous version. If the buckets are sorted, you also need to set hive.enforce.sorting to true.
② The original file is explicit, because the delimited character is a control character that cannot be printed, the fields are squeezed together.

3, give a complete small example: (1) Build Student & Student1 table:

`1`	`createtable` `student(id` `INT, age` `INT,` `nameSTRING)`

`2`	`partitioned` `by(stat_date STRING)`

`3`	`clustered` `by(id) sorted` `by(age)` `into2 buckets`

`4`	`row format delimited fields terminated` `by‘,‘;`

5

`6`	`createtable` `student1(id` `INT, age` `INT,` `nameSTRING)`

`7`	`partitioned` `by(stat_date STRING)`

`8`	`clustered` `by(id) sorted` `by(age)` `into2 buckets`

`9`	`row format delimited fields terminated` `by‘,‘;`

(2) Setting environment variables:

Set hive.enforce.bucketing = true;

(3) Insert data:

`01`	`cat bucket.txt`

02

`03`	`1,20,zxm`

`04`	`2,21,ljz`

`05`	`3,19,cds`

`06`	`4,18,mac`

`07`	`5,22,android`

`08`	`6,23,symbian`

09 7,25,wp

10

`11`	`LOADDATA` `local` `INPATH` `‘/home/lijun/bucket.txt‘` `OVERWRITE` `INTO` `TABLEstudent partition(stat_date="20120802");`

12

`13`	`fromstudent`

`14`	`insertoverwrite` `tablestudent1 partition(stat_date="20120802")`

`15`	`selectid,age,name` `where` `stat_date="20120802"` `sort` `byage;`

(4) View the file directory:

Hadoop fs-ls/hive/warehouse/test.db/student1/stat_date=20120802
Found 2 Items
-rw-r--r--2 lijun supergroup 2013-11-24 19:16/hive/warehouse/test.db/student1/stat_date=20120802/000000_0
-rw-r--r--2 lijun supergroup 2013-11-24 19:16/hive/warehouse/test.db/student1/stat_date=20120802/000001_0

(5) View sampling data:

Hive> SELECT * from Student1 tablesample (buckets 1 out of 2 on ID);

Total MapReduce jobs = 1
Launching Job 1 out of 1
.......
Ok
4 Mac 20120802
2 LJZ 20120802
6 Symbian 20120802
Time taken:20.608 seconds

Note: Tablesample is a sample statement, Syntax: Tablesample (BUCKET x out of Y)
Y must be a multiple or a factor of the total number of buckets in the table. Hive determines the proportion of samples based on the size of Y. For example, table has a total of 64 parts, when y=32, extract (64/32=) 2 buckets of data, when y=128, extract (64/128=) 1/2 buckets of data. x indicates from which bucket to start the extraction. For example, table Total bucket is 32,tablesample (bucket 3 out of 16), representing a total of 2 buckets (32/16=) of data, respectively, 3rd buckets and (3+16=) 19 buckets of data.

4, Refer:

http://rdc.taobao.org/?p=1457 A migration process from Mr to hive–

Http://blog.573114.com/Blog/Html/A031/516857.html Hadoop Definitive Guide Chapter 12th hive Introduction P384

http://superlxw1234.iteye.com/blog/1545150 hive--sort Merge Bucket Map Join

http://blog.csdn.net/yfkiss/article/details/7816916

"Go" Hive Base: Partition, bucket, Sort Merge bucket Join

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More