"Go" Hive Base: Partition, bucket, Sort Merge bucket Join

Source: Internet
Author: User
Tags table definition hadoop fs

Hive is now the most common and inexpensive solution for building data warehouses in the Big data era, although there are also other rising stars such as Impala, but the status of hive is not yet shaken in terms of functionality and stability.

In fact, this blog is mainly to talk about SMB join, join is the most core part of the whole mr/hive, is the part of each HADOOP/HIVE/DW RD must master, before there are several articles chatted to mr/hive in the join, in fact, the bottom is the same, just on Layer to do some encapsulation, if you do not know exactly how joins are, and how the underlying implementation, please refer to the following links:

Introduction to several schemes of two table joins in http://my.oschina.net/leejun2005/blog/95186 MapReduce

http://my.oschina.net/leejun2005/blog/111963 Hadoop Multi-table Join:map side Join Example

http://my.oschina.net/leejun2005/blog/158491 Hive & Performance Learning Notes

In the last link, there are two graphs:

The first two very good understanding, basically everyone will come into contact, but the last one, there may be classmates or relatively unfamiliar, the purpose of SMB is mainly to solve the big table and big table Join problem, the sub-barrel is actually the big table into a "small table", and then map-side Join solution, This is the typical idea of divide and conquer. Before we talk about SMB Join, we should review the underlying concepts.

1. Hive Partition Table

In a hive select query, the entire table content is generally scanned, and it consumes a lot of time to do unnecessary work. Sometimes you only need to scan a subset of the data in the table, so the partition concept is introduced when the table is built. A partitioned table refers to the partition space of the partition specified when the table was created.

Hive can manage the data in a column or column, and the so-called partitioning can be explained by the following example.
The current Internet application stores a large number of log files every day, a few g, dozens of g or even more is possible. Store the log, which must have a property that is the date the log was generated. When a partition is generated, it can be divided by the date column produced by the log. Treat each day's log as a partition.
The Data organization component area, mainly can improve the data query speed. It is up to the user to decide which partition the user stores for each record. That is, when the user loads the data, it must display the partition to which the part of the data is assigned.

1.1 Implementation Details

1, a table can have one or more partitions, each partition as a folder in the form of a separate table folder directory.
2, table and column names are case-insensitive.
3, the partition is in the form of a field in the table structure exists, through the describe Table command can be viewed to the existence of a field, but the field does not hold the actual data content, only the representation of the partition (pseudo-column).

1.2 Syntax

1. Create a partitioned table with DS as the partitioning column:
CREATE TABLE invites (ID int, name string) partitioned by (DS string) row format delimited-terminated by ' t ' stored As Textfile;
2. Add the data to the partition with the time of 2013-08-16:
Load data local inpath '/home/hadoop/desktop/data.txt ' Overwrite into table invites partition (ds= ' 2013-08-16 ');
3. Add the data to the partition with the time of 2013-08-20:
Load data local inpath '/home/hadoop/desktop/data.txt ' Overwrite into table invites partition (ds= ' 2013-08-20 ');
4. Query data from a partition:
SELECT * from invites where ds = ' 2013-08-12 ';
5. Add data to a partition in a partitioned table:
Insert Overwrite table invites partition (ds= ' 2013-08-12 ') select Id,max (name) from the test group by ID;
To view the specifics of the partition, use the command:
Hadoop fs-ls/home/hadoop.hive/warehouse/invites
Or:
Show Partitions tablename;

2. Hive Bucket

For each table or partition, hive can be further organized into buckets, which means that buckets are more granular data range divisions. Hive is also an organization of buckets for a column. Hive uses a hash of the column values, divided by the number of buckets, to determine which bucket the record is stored in.

There are two reasons to organize a table (or partition) into buckets (buckets):

(1) Get higher efficiency of query processing. The bucket adds an extra structure to the table, and Hive can take advantage of this structure when dealing with some queries. Specifically, connecting two tables that divide the buckets on the same column (which contains the connection columns) can be implemented efficiently using the MAP-side join (map-side join). such as join operations. For join operations, two tables have an identical column if both tables are bucket-operated. Then the bucket with the same column values will be saved as a JOIN operation, which can significantly reduce the amount of data to join.

(2) Make sampling (sampling) more efficient. When working with large datasets, it can be handy to run queries on a small subset of data in a dataset at the stage of developing and modifying queries.

1. Create a table with buckets:

CREATE TABLE Bucketed_user (ID int,name string) clustered by (ID) sorted by (name) into 4 buckets row format delimited field s terminated by ' \ t ' stored as textfile;
First, let's see how to tell hive-that a table should be divided into buckets. We use the clustered by clause to specify the number of buckets to divide by the columns used to divide buckets:

CREATE TABLE bucketed_user (id INT) name STRING)
CLUSTERED by (ID) into 4 BUCKETS;

Here, we use the user ID to determine how to divide the bucket (hive uses a hash of the value and divides the result by the number of buckets to take the remainder.) In this way, any bucket will have a random user collection (PS: can actually be said to be random, right?). )。

For a map-side connection, two tables divide the buckets in the same way. Processing the mapper of a bucket in the left table knows that the matching rows in the right table are within the corresponding bucket. Therefore, mapper only needs to get the bucket (which is just a small portion of the data stored in the right table) to connect. This optimization method does not necessarily require that two tables must be the same number of buckets, the number of buckets in two tables is a multiple relationship can also. Use HIVEQL to connect two tables divided into buckets, see the Map Connections section (P400).

The data in a bucket can be sorted in addition to one or more columns. Because of this, the connection to each bucket becomes an efficient merge sort (merge-sort), which can further improve the efficiency of the map-side connection. The following syntax declares a table so that it uses a sort bucket:

CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED by (ID) SORTED by (ID ASC) into 4 BUCKETS;

How do we ensure that the data in the table is divided into buckets? It is certainly possible to load the data generated outside of hive into a table that is divided into buckets. In fact, it's easier for hive to divide buckets. This is usually done for an existing table.

Hive does not check whether a bucket in the data file matches the bucket in the table definition (either for the number of buckets or for the column that divides the buckets). If the two do not match, you may encounter errors or undefined results when you poll. Therefore, it is recommended that hive be manipulated to divide the buckets.

There is a user table that does not divide buckets:
Hive> SELECT * from users;
0 Nat
2 Doe
B Kay
4 Ann

2. Force multiple reduce to output:

To populate a bucket table with members, you need to set the Hive.enforce.bucketing property to True. ① this way, Hive knows to create buckets with the number declared in the table definition. You can then use the INSERT command. It is important to note that clustered by and sorted by do not affect the import of data, which means that users must themselves be responsible for how the data is imported, including the buckets and sorting of the data.
' Set hive.enforce.bucketing = True ' can automatically control the number of the previous round of reduce to fit the number of buckets, of course, Users can also set the mapred.reduce.tasks to fit the bucket number, it is recommended to use ' Set hive.enforce.bucketing = True '

3. Insert data into the table:

INSERT OVERWRITE TABLE bucketed_users SELECT * from users;

Physically, each bucket is a file in the table (or partition) directory. Its filename is not important, but bucket n is the nth file sorted by dictionary order. In fact, buckets correspond to the output file partition of MapReduce: One job produces the same buckets (output file) and reduce task count. We can understand this by looking at the layout of the Bucketd_users table you just created. Run the following command:

4. View the structure of the table:

Hive> dfs-ls/user/hive/warehouse/bucketed_users;
4 new files are displayed. The file name is as follows (the file name contains a timestamp and is generated by hive, so each run changes):
Attempt_201005221636_0016_r_000000_0
Attempt_201005221636_0016_r-000001_0
Attempt_201005221636_0016_r_000002_0
Attempt_201005221636_0016_r_000003_0
The first bucket includes the user Ido and 4, because the hash value of an int is the integer itself, divided here by the remainder of the number of buckets (4): ②

5. Read the data to see the data for each file:

Hive> Dfs-cat/user/hive/warehouse/bucketed_users/*0_0;
0 Nat
4 Ann

We can get the same result by sampling the table with the TABLESAMPLE clause. Instead of using the entire table, this clause restricts the query to a portion of the table's bucket:

6. Sample the data in the bucket:

Hive> SELECT * from Bucketed_users
> Tablesample (BUCKET 1 out of 4 on ID);
0 Nat
4 Ann

The number of buckets is counted starting from 1. Therefore, the previous query gets all the users from the first of the 4 buckets. For a large, evenly distributed set of data, this returns about One-fourth of the rows in the table. We can also sample several buckets at other scales (since sampling is not an exact operation, so this ratio is not necessarily a multiple of the number of buckets). For example, the following query returns half of the buckets:

7. Query half the number of barrels returned:

Hive> SELECT * from Bucketed_users
> Tablesample (BUCKET 1 out of 2 on ID);
0 Nat
4 Ann
2 Joe

Because the query only needs to read the buckets that match the TABLESAMPLE clause, sampling the bucket table is a very efficient operation. If you use the rand () function to sample a table that is not divided into buckets, scan the entire input dataset even if you only need to read a small subset of the samples:

Hive〉select * from users
> Tablesample (BUCKET 1 out of 4 on Rand ());
2 Doe

① starting with Hive 0.6.0, mapred.reduce. Tasks must be set to the number of buckets in the table to be populated in the previous version. If the buckets are sorted, you also need to set hive.enforce.sorting to true.
② The original file is explicit, because the delimited character is a control character that cannot be printed, the fields are squeezed together.

3, give a complete small example: (1) Build Student & Student1 table:
1 createtable student(id INT, age INTnameSTRING)
2 partitioned by(stat_date STRING)
3 clustered by(id) sorted by(age) into2 buckets
4 row format delimited fields terminated by‘,‘;
5
6 createtable student1(id INT, age INTnameSTRING)
7 partitioned by(stat_date STRING)
8 clustered by(id) sorted by(age) into2 buckets
9 row format delimited fields terminated by‘,‘;
(2) Setting environment variables:

Set hive.enforce.bucketing = true;

(3) Insert data:
01 cat bucket.txt
02
03 1,20,zxm
04 2,21,ljz
05 3,19,cds
06 4,18,mac
07 5,22,android
08 6,23,symbian
09 7,25,wp
10
11 LOADDATA local INPATH ‘/home/lijun/bucket.txt‘ OVERWRITE INTO TABLEstudent partition(stat_date="20120802");
12
13 fromstudent
14 insertoverwrite tablestudent1 partition(stat_date="20120802")
15 selectid,age,name where stat_date="20120802" sort byage;
(4) View the file directory:

Hadoop fs-ls/hive/warehouse/test.db/student1/stat_date=20120802
Found 2 Items
-rw-r--r--2 lijun supergroup 2013-11-24 19:16/hive/warehouse/test.db/student1/stat_date=20120802/000000_0
-rw-r--r--2 lijun supergroup 2013-11-24 19:16/hive/warehouse/test.db/student1/stat_date=20120802/000001_0

(5) View sampling data:

Hive> SELECT * from Student1 tablesample (buckets 1 out of 2 on ID);

Total MapReduce jobs = 1
Launching Job 1 out of 1
.......
Ok
4 Mac 20120802
2 LJZ 20120802
6 Symbian 20120802
Time taken:20.608 seconds

Note: Tablesample is a sample statement, Syntax: Tablesample (BUCKET x out of Y)
Y must be a multiple or a factor of the total number of buckets in the table. Hive determines the proportion of samples based on the size of Y. For example, table has a total of 64 parts, when y=32, extract (64/32=) 2 buckets of data, when y=128, extract (64/128=) 1/2 buckets of data. x indicates from which bucket to start the extraction. For example, table Total bucket is 32,tablesample (bucket 3 out of 16), representing a total of 2 buckets (32/16=) of data, respectively, 3rd buckets and (3+16=) 19 buckets of data.

4, Refer:

http://rdc.taobao.org/?p=1457 A migration process from Mr to hive–

Http://blog.573114.com/Blog/Html/A031/516857.html Hadoop Definitive Guide Chapter 12th hive Introduction P384

http://superlxw1234.iteye.com/blog/1545150 hive--sort Merge Bucket Map Join

http://blog.csdn.net/yfkiss/article/details/7816916

"Go" Hive Base: Partition, bucket, Sort Merge bucket Join

Related Article
Large-Scale Price Reduction
  • 59% Max. and 23% Avg.
  • Price Reduction for Core Products
  • Price Reduction in Multiple Regions
undefined. /
Connect with us on Discord
  • Secure, anonymous group chat without disturbance
  • Stay updated on campaigns, new products, and more
  • Support for all your questions
undefined. /
Free Tier
  • Start free from ECS to Big Data
  • Get Started in 3 Simple Steps
  • Try ECS t5 1C1G
undefined. /

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.