Hive Query Optimization Summary
Storage, learning, sharing one, join optimization
The basic principle of join lookup operations: You should place a table/subquery with fewer entries on the left side of the join operator. The reason is that in the reduce phase of the join operation, the contents of the table on the left side of the join operator are loaded into memory, and the table with fewer entries is left, which effectively reduces the chance of memory overflow errors.
If multiple joins exist in a join lookup operation, and all of the tables participating in the join have the same key that participates in the join, all joins are merged into a single mapred program.
Case:
SELECT A.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = b.key1) Execute join in a mapre program
SELECT A.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = b.key2) perform join in two mapred programs
The key to a MAP join is the small amount of data in a table in a join operation, case:
SELECT/*+ Mapjoin (b) */A.key, A.value
From a join B on a.key = B.key
The mapjoin limitation is that a full/right OUTER join B cannot be performed, and the hive parameters associated with the map join are: Hive.join.emit.interval Hive.mapjoin.size.key Hive.mapjoin.cache.numrows
Because the join operation is performed before the where operation, the where condition does not serve to reduce the join data when you execute the join; case:
SELECT A.val, B.val from a left OUTER JOIN B on (A.key=b.key)
WHERE a.ds= ' 2009-07-07 ' and b.ds= ' 2009-07-07 '
Best modified to:
SELECT A.val, B.val from a left OUTER JOIN b
On (A.key=b.key and b.ds= ' 2009-07-07 ' and a.ds= ' 2009-07-07 ')
In each of the mapred programs in the join operation, hive streams the data from the table that appears in the join statement, and caches the data relative to the previous variable in memory. Of course, you can also manually specify a stream table: SELECT/*+ streamtable (a) */A.val, B.val, c.val from a join B on (A.key = b.key1) join C on (C.key = B.key1) Second, group by optimization
Map-side aggregation, first on the map end of the initial aggregation, and finally in the reduce end of the final results, related parameters:
· Hive.map.aggr = True if aggregation is done on the map side, which defaults to True
· Hive.groupby.mapaggr.checkinterval = 100000 number of entries in the Map end for aggregation operations
Data skew aggregation optimization, set parameter Hive.groupby.skewindata = True, the selected item is set to True, the generated query plan will have two MR jobs. In the first MR Job, the output set of MAP is randomly distributed to reduce, and each reduce does a partial aggregation operation and outputs the result so that the same Group by Key may be distributed to different reduce to achieve load balancing purposes; Two MR jobs are then distributed to reduce based on the preprocessed data results (this process guarantees that the same group by key is distributed to the same reduce), and finally the final aggregation operation is completed. Iii. Merging of small documents
An excessive number of files can put pressure on HDFS and affect processing efficiency by merging the results files of Map and Reduce to eliminate this effect:
· Hive.merge.mapfiles = True if and Map output file, default to True
· Hive.merge.mapredfiles = False if the Reduce output file is merged, the default is False
· Hive.merge.size.per.task = 256*1000*1000 the size of the merged file four, hive implementation (not) in
Query with the LEFT OUTER join (assuming that table B contains another field Key1
Select A.key from a LEFT outer join B in A.key=b.key where b.key1 is null
Implemented in by the left semi join
SELECT A.key, A.val from a left SEMI JOIN b on (A.key = B.key)
Limitations of the left semi join: The right-hand table in the join condition can only appear in the join condition. Five, sorting optimization
Order BY to achieve global ordering, a reduce implementation, low efficiency
The Sort by implementation is partially ordered, and the results of a single reduce output are ordered and efficient, and are usually used with the distribute by keyword (the Distribute by keyword can specify a map to the reduce-side distribution key)
CLUSTER by col1 equivalent to distribute by col1 SORT by col1 VI, using partitions
Each partition in hive corresponds to a directory on HDFs, and the partition column is not an actual field in the table, but one or more pseudo-columns, which in fact do not save the information and data for the partitioned column in the table's data file. The partition keyword is preceded by the primary partition (only one), followed by the secondary partition
Static partitioning: Static partitions need to be specified in SQL statements when loading data and using them
Case: (stat_date= ' 20120625 ', province= ' Hunan ')
Dynamic partitioning: Use dynamic partitioning to set the Hive.exec.dynamic.partition parameter value to True, the default value is False, and by default, Hive assumes that the primary partition is statically partitioned, the secondary partition uses dynamic partitioning, and if you want to use dynamic partitioning, you need to set Hive.exec.dynamic.partition.mode=nostrick, default is Strick
Case: (stat_date= ' 20120625 ', province) VII, DISTINCT use
Hive supports multiple distinct operations on the same column at group by, but does not support distinct operations on more than one column in the same statement. Viii. hql using a custom mapred script
Caveats: When using a custom mapred script, the keyword map REDUCE is the statement select TRANSFORM (...) Syntax conversion does not mean that a new map process is forced when using the Map keyword, and a red process is generated when using the Reduce keyword.
A custom mapred script can be a more complex function of the HQL statement, but performance is a bit worse than the HQL statement, and should be avoided as much as possible, using the UDTF function to replace the custom mapred script, if available nine, UDTF
UDTF converts a single input row into multiple output lines, and when using UDTF, the SELECT statement cannot contain additional columns, UDTF does not support nesting, and does not support the group by, sort by, and so on. If you want to avoid these limitations, you need to use the lateral view syntax, case:
Select A.timestamp, Get_json_object (a.appevents, ' $.eventid '), Get_json_object (a.appenvets, ' $.eventname ') from log A;
Select A.timestamp, b.*
From log a lateral view json_tuple (a.appevent, ' EventID ', ' eventname ') b as F1, F2;
Where Get_json_object is the UDF function, Json_tuple is the UDTF function.
The UDTF function can greatly improve the performance of HQL statements in some scenarios, such as applications that need to parse JSON or XML data multiple times. 10. Aggregate function count and sum
The count and sum functions are probably the most frequent two aggregate functions used in HQL statements, but the Count function in hive supports conditional filtering when calculating distinct value.