Hive Join and hivejoin

Source: Internet
Author: User
Tags table definition

Hive Join and hivejoin

Https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

LanguageManualJoins

Join Syntax

Hive supports the following table join syntax structure:

Join_table:

Table_reference JOIN table_factor [join_condition]

| Table_reference {LEFT | RIGHT | FULL} [OUTER] JOIN table_reference join_condition

| Table_reference left semi JOINtable_reference join_condition

| Table_reference cross join table_reference [join_condition] (as of Hive 0.10)

 

Table_reference:

Table_factor

| Join_table

 

Table_factor:

Tbl_name [alias]

| Table_subquery alias

| (Table_references)

 

Join_condition:

ON equality_expression (ANDequality_expression )*

 

Equality_expression:

Expression = expression

Note: In Hive joins, outerjoins and left semi joins only support equality connections and do not support inequality connections, because it is difficult to convert them into map/reduce jobs.

 

Version 0.ation +: Implicit joinnotation (Implicit Connection Symbol)

Implicit join notation is supported starting from hive0.ation. The from clause is allowed to join tables separated by commas. The join keyword is omitted, as shown below:

SELECT *
FROM table1t1, table2 t2, table3 t3
WHERE t1.id = t2.id AND t2.id = t3.id AND t1.zipcode = '20140901 ';

 

Version 0.ences +: Unqualified columnreferences

Reference of unspecified fields is supported starting from Hive0.13.0, as follows:

Create table a (k1 string, v1 string );
Create table B (k2 string, v2 string );

SELECT k1, v1, k2, v2
FROM a JOIN B ON k1 = k2;

If a field appears in multiple tables, Hive identifies it as an ambiguous reference.

Examples

Below are some important points about the join connection of Hive:

1) only equality join is supported.

SELECTa. * FROM a JOIN B ON (a. id = B. id );

SELECTa. * FROM a JOIN B ON (a. id = B. id AND a. department = B. department );

2) Multiple tables can be joined.

SELECTa. val, B. val, c. val FROM a JOIN B ON (a. key = B. key1) JOIN c ON (c. key = B. key2 );

3) generate a MRJob: Multi-table JOIN. If each table in multiple tables uses the same column for JOIN (appears in the JOIN clause ), only one MR (map/reduce) Job will be generated, for example:

SELECT a. val, B. val, c. val FROM a JOIN B ON (a. key = B. key1) JOIN c ON (c. key = B. key1 );

The three tables a, B, and c respectively use the same field for JOIN, that is, the same field appears in both JOIN clauses, so that only one MRJob is generated.

Multiple mrjobs are generated: Multi-table JOIN. If multiple tables exist, one table uses at least two fields for JOIN (at least two columns of the same table are listed in the JOIN clause ), at least two mrjobs will be generated, and the following SQL will be converted into two map/reduce tasks:

SELECTa. val, B. val, c. val FROM a JOIN B ON (a. key = B. key1) JOIN c ON (c. key = B. key2 );

The three tables are connected based on two fields. The two fields B. key1 and B. key2 both appear in table B. The connection process is as follows: first, Tables a and B are based on. key and B. key1 is connected, which corresponds to the first MRJob. Table a and B are connected to c, which corresponds to the second MRJob.

4)Table connection Sequence Optimization

Multi-table JOIN is converted into multiple mrjobs. Each MR Job is called a JOIN Stage in Hive ). In each Stage, the last table in the JOIN order should be a large table as much as possible, because the data generated in the previous Stage of JOIN will exist in the buffer of CER, through the last table of stream, read the buffered intermediate result data directly from the Reducer buffer (the intermediate result data may be in the JOIN order, and the Key of the result joined in the preceding table. The data volume is relatively small, memory overhead is small). In this way, when connecting to the subsequent large tables, you only need to read the cached Key from the buffer and connect it to the specified Key in the large table, which is faster, it may also avoid memory buffer overflow. For example:

SELECTa. val, B. val, c. val FROM a JOIN B ON (a. key = B. key1) JOIN c ON (c. key = B. key1 );

This JOIN statement generates an MRJob. When the JOIN order is selected, the data volume is less than B <c. Tables a and B are based on. key = B. key1 is connected. The result (the Key connected based on a and B) is cached in the buffer on the CER and connected to c, read Key (. key = B. key1. key.
In addition, you can also give some Hint information to inspire the JOIN operation, which specifies which table to use as a large table for optimization. For example:

SELECT/* + STREAMTABLE (a) */. val, B. val, c. val FROM a JOIN B ON (. key = B. key1) JOINc ON (c. key = B. key1 );

In the preceding JOIN statement, Table a is regarded as a large table. Table B and Table c are joined first, and then the result is joined with table.

If STREAMTABLE is omitted, Hive will join the rightmost table of streams.

5) The existence of LEFT, RIGHT, and FULLOUTER joins is to provide more control over the on statement that does not match.

SELECT a. val, B. val FROM a left outer join B ON (a. key = B. key );

6) condition-based leftouter join optimization (the logic is also suitable for RIGHTand FULL joins)

The join operation is performed before the where statement.

When the left join operation is performed, the Field Values in the left table are retained. The Field Values in the right table that are not connected are empty.

For example:

SELECT a. val, B. val FROM a left outerjoin B ON (a. key = B. key)

WHERE a. ds = '2017-06-21 'ANDb. ds = '2017-06-21 ';

The execution sequence is: join tables a and B first, and then filter the results by the where condition. In this way, we will find that a large number of results may be output during the join process, it is time-consuming to filter these results.

During optimization, you can place the where condition in the on statement as follows:

SELECT a. val, B. val FROM a left outerjoin B

ON (a. key = B. key AND B. ds = '2017-06-21 'anda. ds = '2017-06-21 ');

 

 

Join cannot exchange locations. Both LEFT and RIGHT join are connected to the LEFT, for example:

 

SELECT a. val1, a. val2, B. val, c. val

FROM

JOIN B ON (a. key = B. key)

Left outer join c ON (a. key = c. key );

 

Join table a and table B first, discard all records that do not match in joinkey, and then join the table c with the result of this middle. When a key exists in both table a and Table c but table B does not, the entire record is lost in the first join, that is, ajoin B (including. val1,. val2 and. key), and then when we join the c table, we will get the result as follows:. val1,. val2, B. val, null.

If right outerjoin is used instead of LEFT, we will get the following result:

NULL, c. val

 

Example:

Hive (hive)> select * from;

A. id a. name

1 jiangshouzhuang

2 zhangyun

 

Hive (hive)> select * from B;

B. id B. name

1 jiangshouzhuang

3 baobao

 

Hive (hive)> select * from c;

C. id c. name

2 zhangyun

4 xiaosan

 

Hive (hive)> SELECT a. name, B. name, c. name

> FROM

> JOIN bON (a. id = B. id)

> Leftouter join c ON (a. id = c. id );

Jiangshouzhuang NULL

 

Hive (hive)> SELECT a. name, B. name, c. name

> FROM

> JOIN bON (a. id = B. id)

> RightOUTER JOIN c ON (a. id = c. id );

A. name B. name c. name

NULL zhangyun

NULL xiaosan

 

Hive (hive)> SELECT a. name, B. name, c. name

> FROM cLEFT outer join a ON (c. id = a. id) left outer join B

A. name B. name c. name

Zhangyun NULL zhangyun

NULL xiaosan

7) left semi join)

The left semi-join can more effectively implement query semantics similar to in/exists, for example:

SELECTa. key, a. value

FROMa

WHEREa. key in

(SELECT B. key from B );

You can replace it with the following statement:

SELECTa. key, a. val

FROMa left semi join B ON (a. key = B. key );

Note that in leftsemi join, table B can only appear after the on clause and cannot appear in the select and where clause.

Hive supports subqueries as follows:

· In analyticdb 0.12, only subqueries in the from clause are supported;

· In version 0.13, subqueries in the where clause are also supported;

· If no package is included IN 0.13, IN/NOTIN/EXISTS/not exists supports subquery.

8) MapSide Join

The optimization of MapSide Join is based on the output of Map tasks. data does not need to be copied to Reduce nodes, thus reducing the overhead of data transmission between network nodes.

For multi-table join, if only one table is large and other tables are small, the join operation is converted into a job that only contains Map. For example:

SELECT/* + MAPJOIN (B) */a. key, a. value

FROMa JOIN B ON a. key = B. key;

Every map of Table a's data can fully read the data of Table B.

Note: Table a and table B cannot perform the FULL/RIGHTOUTER JOIN operation.

 

Supplement:

One of the built-in optimization mechanisms provided by Hive includes MapJoin:

Before Hivev0.7, you must provide the MapJoin instruction (hint) before Hive can optimize MapJoin.

Hivev0.7 and later versions are optimized without the MapJoin instruction.

It is controlled by the following configuration parameters:

Hive> set hive. auto. convert. join = true;

After hive0.11, when the table size meets the settings

Hive. auto. convert. join. noconditionaltask = true

Hive. auto. convert. join. noconditionaltask. size = 10000000

Hive. mapjoin. smalltable. filesize = 25000000

By default, the join operation is converted to mapjoin (hive. ignore. mapjoin. hint = true, hive. auto. convert. join = true)

In Hivev0.12.0, MapJoin optimization is enabled by default.

That is, hive. auto. convert. join = true.

Hive also provides another parameter-the table file size as the threshold for enabling and disabling MapJoin.

Hive. mapjoin. smalltable. filesize = 25000000

 

9) BucketMap Side Join

If the table is joined and the joined column is also the bucket column, and the number of buckets in one table is a multiple of the buckets in another table, the buckets between tables can be joined.

If table A has four buckets and table B also has four buckets, then join

SELECT/* + MAPJOIN (B) */a. key, a. value

FROMa JOIN B ON a. key = B. key;

You only need to complete it in the mapper stage. By default, every bucket in Table a obtains every bucket in Table B for join operations, which causes certain overhead, only buckets in Table B that meet the join condition can be truly connected to the bucket in table.

You can set the following parameters for optimization:

Sethive. optimize. bucketmapjoin = true;

In this way, the join process is: The bucket1 of Table B will only join the bucket1 of Table B, instead of considering other bucket2 ~ 4.

 

Example:

Create Table:

CREATE TABLE a(key INT, value STRING)
CLUSTERED BY(key) INTO 6 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE;
 
Create Table B:
CREATE TABLE b(key INT, value STRING)
CLUSTERED BY(key) INTO 36 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE;
Now we need to perform the JOIN Operation Based on a. key and B. key. The JOIN column is also a BUCKET column. The JOIN statement is as follows:
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a JOIN b ON a.key = b.key;
 
The JOIN process is that BUCKET 1 of table a only performs JOIN with BUCKET 1 in Table B, and does not consider other BUCKET 2 in table B ~ 36. If the above tables have the same BUCKET, for example, all 36 are sorted, that is, the following constraints are added after clustered by (key) in the table definition:
SORTED BY(key)
The preceding JOIN statement executes an Sort-Merge-Bucket (SMB) JOIN statement. You also need to set the following parameters to change the default behavior. when optimizing the JOIN statement, you can only traverse the relevant BUCKET:
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
 
The default values of the preceding three parameters are as follows:
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
hive.optimize.bucketmapjoin=false
hive.optimize.bucketmapjoin.sortedmerge=false;

10) MapJoin Restrictions

SELECT/* + MAPJOIN (B) */a. key, a. value

FROMa JOIN B ON a. key = B. key;

Reducer is not required. For each mapper of Table A, Table B can be fully read.

All of the items listed below are not supported by MapJoin:

• UnionFollowed by a MapJoin

• LateralView Followed by a MapJoin

• ReduceSink (Group By/Join/Sort By/Cluster By/Distribute By) Followed by MapJoin

• MapJoinFollowed by Union

• MapJoinFollowed by Join

• MapJoinFollowed by MapJoin

Configure the parameter hive. auto. convert. join = true. If possible, the joins will be automatically converted to mapjoins, which should replace mapjoinhint.

The following query should use mapjoinhint:

If all inputs are bucketed or sorted, and the join operation should be converted to bucketizedmap-size join or bucketized-mergejoin.

 

References

Https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

Https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

Https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

Https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior

 


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.