Hive query optimization Summary

Source: Internet
Author: User
I. Join Optimization

Basic Principles of join search: You should place tables/subqueries with fewer entries on the left of the join operator. The reason is that in the reduce stage of the join operation, the content of the table on the left of the join operator is loaded into the memory, and the table with fewer entries is placed on the left, this effectively reduces the chance of memory overflow errors.

If multiple joins exist in the join query operation, and all the tables involved in the join have the same keys involved in the join operation, all joins are merged into a mapred program.

Case:

Select. val, B. val, C. val from a join B on (. key = B. key1) join C on (C. key = B. key1) execute join in a mapre Program

Select. val, B. val, C. val from a join B on (. key = B. key1) join C on (C. key = B. key2) execute join in two mapred programs

 

The key to map join is that the data volume of a table in the join operation is small. Case study:

Select/* + mapjoin (B) */a. Key, A. Value

From a join B on A. Key = B. Key

The restriction of mapjoin is that a full/right Outer Join B cannot be executed. The hive parameter related to map join is hive. join. emit. interval hive. mapjoin. size. key hive. mapjoin. cache. numrows

Since the join operation is performed before the where operation, the where condition does not reduce the join data when you perform the join operation. case:

Select a. Val, B. Val from a left Outer Join B on (A. Key = B. Key)

Where a. DS = '2017-07-07 'and B. DS = '2017-07-07'

It is best to modify it:

Select a. Val, B. Val from a left Outer Join B

On (A. Key = B. Key and B. DS = '2017-07-07 'and A. DS = '2017-07-07 ')

In every mapred program of the join operation, hive will stream the data of the table that appears in the join statement relative to the back, and the data that changes relative to the front will be cached in the memory. Of course, you can also manually specify the stream-based table: Select/* + streamtable (a) */. val, B. val, C. val from a join B on (. key = B. key1) join C on (C. key = B. key1)

 

 

Ii. Group by optimization

Map-side aggregation: first, preliminary aggregation is performed on the map-side, and finally the final result is obtained on the reduce side. related parameters:

· Hive. Map. aggr = true: whether to aggregate data on the map end. The default value is true.

· Hive. groupby. mapaggr. checkinterval = 100000 number of entries for aggregation on the map end

Optimize data skew aggregation by setting the parameter hive. groupby. skewindata = true. When the option is set to true, two Mr Jobs are generated for the generated query plan. In the first Mr job, the map output result set is randomly distributed to reduce. Each reduce performs partial aggregation and outputs the result, in this way, the same group by key may be distributed to different reduce tasks to achieve load balancing; the second Mr job is then distributed to reduce according to the pre-processed data results according to the group by key (this process can ensure that the same group by key is distributed to the same reduce ), finally, the final aggregation operation is completed.

 

3. Merge small files

An excessive number of files puts pressure on HDFS and affects processing efficiency. You can merge the map and reduce result files to eliminate this impact:

· Hive. Merge. mapfiles = true: whether to merge with the map output file. The default value is true.

· Hive. Merge. mapredfiles = false: whether to merge reduce output files. The default value is false.

· Hive. Merge. Size. Per. Task = 256*1000*1000 size of the merged File

Iv. Hive implementation (not) in

Query by left Outer Join (assume that table B contains another field key1

Select a. Key from a left Outer Join B on A. Key = B. Key where B. key1 is null

Implement in through left semi join

Select a. Key, A. Val from a left semi join B on (A. Key = B. Key)

Restrictions on left semi join: The table on the right of the join condition can only appear in the join condition.

 

V. Sorting Optimization

Order by implements global sorting and one reduce operation, with low efficiency

Sort by is partially ordered. The output result of a single reduce is ordered with high efficiency. It is usually used together with the distrisponby keyword (the distrisponby keyword can specify the distribution key from map to reduce)

Cluster by col1 is equivalent to distribute by col1 sort by col1

 

6. Use partitions

Each partition in hive corresponds to a directory in HDFS. The partition column is not an actual field in the table, but one or more pseudo columns, in fact, table data files do not store partition column information and data. The partition keyword is prefixed to the primary partition (only one), followed by the secondary Partition

Static partitions: when loading data and using static partitions, you must specify

Case study: (stat_date = '123', Province = 'hunanc ')

Dynamic partition: Use a dynamic partition to set hive.exe C. dynamic. the value of partition is true, and the default value is false. By default, hive assumes that the primary partition is a static partition, and the secondary partition uses dynamic partitions. If you want to use dynamic partitions, set hive.exe C. dynamic. partition. mode = nostrick. The default value is strick.

Case study: (stat_date = '123', Province)

 

VII. Use distinct

Hive supports multiple distinct operations on the same column in group by, but does not support distinct operations on multiple columns in the same statement.

 

8. hql uses a custom mapred script

Note: When using a custom mapred script, the keyword map reduce is the statement select transform (...) syntax conversion does not mean that a new map process will be generated when the map keyword is used, and a red process will be generated when the reduce keyword is used.

Custom mapred scripts can be more complex functions completed by hql statements, but the performance is worse than that of hql statements. Avoid using them whenever possible, replace the custom mapred script with the udtf function.

 

9. udtf

Udtf converts a single input row to multiple output rows. When udtf is used, the SELECT statement cannot contain other columns. udtf does not support nesting, it does not support group by, sort by, and other statements. To avoid the preceding restrictions, use the lateral view syntax. Example:

Select a. timestamp, get_json_object (A. appevents, '$. eventid'), get_json_object (A. appenvets,' $. eventname') from log;

Select a. timestamp, B .*

From log a lateral view json_tuple (A. appevent, 'eventid', 'eventname') B as F1, F2;

 

Specifically, get_json_object is the UDF function, and json_tuple is the udtf function.

Udtf functions can greatly improve the performance of hql statements in some application scenarios, such as scenarios where JSON or XML data needs to be parsed multiple times.

 

10. Aggregate functions count and sum

The count and sum functions may be the two most frequently used Aggregate functions in hql statements. However, in hive, the count function supports conditional filtering when calculating distinct values.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.