Hive Query Optimization Summary

Source: Internet
Author: User
Tags joins
Hive Query Optimization Summary

Storage, learning, sharing one, join optimization

The basic principle of join lookup operations: You should place a table/subquery with fewer entries on the left side of the join operator. The reason is that in the reduce phase of the join operation, the contents of the table on the left side of the join operator are loaded into memory, and the table with fewer entries is left, which effectively reduces the chance of memory overflow errors.

If multiple joins exist in a join lookup operation, and all of the tables participating in the join have the same key that participates in the join, all joins are merged into a single mapred program.

Case:

SELECT A.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = b.key1) Execute join in a mapre program

SELECT A.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = b.key2) perform join in two mapred programs

The key to a MAP join is the small amount of data in a table in a join operation, case:

SELECT/*+ Mapjoin (b) */A.key, A.value

From a join B on a.key = B.key

The mapjoin limitation is that a full/right OUTER join B cannot be performed, and the hive parameters associated with the map join are: Hive.join.emit.interval Hive.mapjoin.size.key Hive.mapjoin.cache.numrows

Because the join operation is performed before the where operation, the where condition does not serve to reduce the join data when you execute the join; case:

SELECT A.val, B.val from a left OUTER JOIN B on (A.key=b.key)

WHERE a.ds= ' 2009-07-07 ' and b.ds= ' 2009-07-07 '

Best modified to:

SELECT A.val, B.val from a left OUTER JOIN b

On (A.key=b.key and b.ds= ' 2009-07-07 ' and a.ds= ' 2009-07-07 ')

In each of the mapred programs in the join operation, hive streams the data from the table that appears in the join statement, and caches the data relative to the previous variable in memory. Of course, you can also manually specify a stream table: SELECT/*+ streamtable (a) */A.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = B.key1) Second, group by optimization

Map-side aggregation, first on the map end of the initial aggregation, and finally in the reduce end of the final results, related parameters:

· Hive.map.aggr = True if aggregation is done on the map side, which defaults to True

· Hive.groupby.mapaggr.checkinterval = 100000 number of entries in the Map end for aggregation operations

Data skew aggregation optimization, set parameter Hive.groupby.skewindata = True, the selected item is set to True, the generated query plan will have two MR jobs. In the first MR Job, the output set of MAP is randomly distributed to reduce, and each reduce does a partial aggregation operation and outputs the result so that the same Group by Key may be distributed to different reduce to achieve load balancing purposes; Two MR jobs are then distributed to reduce based on the preprocessed data results (this process guarantees that the same group by key is distributed to the same reduce), and finally the final aggregation operation is completed. Iii. Merging of small documents

An excessive number of files can put pressure on HDFS and affect processing efficiency by merging the results files of Map and Reduce to eliminate this effect:

· Hive.merge.mapfiles = True if and Map output file, default to True

· Hive.merge.mapredfiles = False if the Reduce output file is merged, the default is False

· Hive.merge.size.per.task = 256*1000*1000 the size of the merged file four, hive implementation (not) in

Query with the LEFT OUTER join (assuming that table B contains another field Key1

Select A.key from a LEFT outer join B in A.key=b.key where b.key1 is null

Implemented in by the left semi join

SELECT A.key, A.val from a left SEMI JOIN b on (A.key = B.key)

Limitations of the left semi join: The right-hand table in the join condition can only appear in the join condition. Five, sorting optimization

Order BY to achieve global ordering, a reduce implementation, low efficiency

The Sort by implementation is partially ordered, and the results of a single reduce output are ordered and efficient, and are usually used with the distribute by keyword (the Distribute by keyword can specify a map to the reduce-side distribution key)

CLUSTER by col1 equivalent to distribute by col1 SORT by col1 VI, using partitions

Each partition in hive corresponds to a directory on HDFs, and the partition column is not an actual field in the table, but one or more pseudo-columns, which in fact do not save the information and data for the partitioned column in the table's data file. The partition keyword is preceded by the primary partition (only one), followed by the secondary partition

Static partitioning: Static partitions need to be specified in SQL statements when loading data and using them

Case: (stat_date= ' 20120625 ', province= ' Hunan ')

Dynamic partitioning: Use dynamic partitioning to set the Hive.exec.dynamic.partition parameter value to True, the default value is False, and by default, Hive assumes that the primary partition is statically partitioned, the secondary partition uses dynamic partitioning, and if you want to use dynamic partitioning, you need to set Hive.exec.dynamic.partition.mode=nostrick, default is Strick

Case: (stat_date= ' 20120625 ', province) VII, DISTINCT use

Hive supports multiple distinct operations on the same column at group by, but does not support distinct operations on more than one column in the same statement. Viii. hql using a custom mapred script

Caveats: When using a custom mapred script, the keyword map REDUCE is the statement select TRANSFORM (...) Syntax conversion does not mean that a new map process is forced when using the Map keyword, and a red process is generated when using the Reduce keyword.

A custom mapred script can be a more complex function of the HQL statement, but performance is a bit worse than the HQL statement, and should be avoided as much as possible, using the UDTF function to replace the custom mapred script, if available nine, UDTF

UDTF converts a single input row into multiple output lines, and when using UDTF, the SELECT statement cannot contain additional columns, UDTF does not support nesting, and does not support the group by, sort by, and so on. If you want to avoid these limitations, you need to use the lateral view syntax, case:

Select A.timestamp, Get_json_object (a.appevents, ' $.eventid '), Get_json_object (a.appenvets, ' $.eventname ') from log A;

Select A.timestamp, b.*

From log a lateral view json_tuple (a.appevent, ' EventID ', ' eventname ') b as F1, F2;

Where Get_json_object is the UDF function, Json_tuple is the UDTF function.

The UDTF function can greatly improve the performance of HQL statements in some scenarios, such as applications that need to parse JSON or XML data multiple times. 10. Aggregate function count and sum

The count and sum functions are probably the most frequent two aggregate functions used in HQL statements, but the Count function in hive supports conditional filtering when calculating distinct value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.