Reason for inclination:
It is our ultimate goal to make the output data of map more evenly distributed to reduce. Due to the limitations of the hash algorithm, the key hash will result in more or less data skew. A great deal of experience shows that the reason for data skew is human-induced negligence or business logic that can be circumvented.
Solution Ideas :
The execution of the hive is phased, and the difference in map processing data depends on the reduce output of the previous stage, so how to distribute the data evenly among each reduce is the root of the data skew.
Specific measures:
Memory optimization and I/O Optimized :
Driver table: Use large table to do driver table, to prevent memory overflow; the right-most table of join is the driver table; Mapjoin ignores the join order and makes a driver table with a large table; streamtable.
1. Mapjoin is a means of avoiding data skew
Allow the join operation in the map phase, mapjoin the small table all into memory, in the map phase directly with the other table data and in-memory table data matching, because in the map is a join operation, save reduce the efficiency of the operation is also much higher
In the "Hive:join encounter problems" have specific action
When joining operations on multiple table joins, place the small table on the left side of the join, the large table on the right side of the Jion,
Data in the small table is cached in memory when performing such join connections, which effectively reduces the chance of memory overflow errors
2. Setting parameters
Hive.map.aggr = True
Hive.groupby.skewindata=true There are other parameters
3.SQL language Adjustment
For example: Group by dimension spends hours: Replace count (distinct) with sum () group by to complete the calculation
4.StreamTable
When a join operation is made in reducer, the small table is put into memory, and the large table is read by stream
5. Index
Hive has been available since 0.80, providing a bitmap bitmap index that accelerates the execution of GROUP by query statements by less.
Other optimizations:
1. Row cropping (column pruning): Only columns to be used for output
2, predicate push (predicate pushdown): Early data filtering (see Figure 7, the following is the first processing of the logic
Reduce the amount of data to be processed later
3. Partition clipping (Partition pruning): Only read files that meet the partitioning criteria
4, Map-join: For some small files in the join, you can join operation in the map phase, see 3.2.2 Map-join section
5, Join-reordering: will be in reducer in the join operation when the small table into memory, and large table through
Stream mode Read
6, group-by optimization: Local aggregation for optimization (including hash-based and sort-based), for skew
Key (the row num and size of key are very uneven at reduce) can be optimized two times map-reduce
Hive configuration parameters are more conservative, so the efficiency will be very close, modify the configuration will make the query more efficient, record a few of the impact of the query efficiency more important parameters.
Meta Data:
Nested SQL parallel execution optimizations:
Set hive.exec.parallel=true;
Set hive.exec.parallel.thread.number=16;
Sorting optimization
Order BY to achieve global ordering, a reduce implementation, low efficiency
The Sort by implementation is partially ordered, and the results of a single reduce output are ordered and efficient, and are usually used with the distribute by keyword (the Distribute by keyword can specify a map to the reduce-side distribution key)
CLUSTER by col1 is equivalent to distribute by Col1 SORT by Col1.
Merging small files
An excessive number of files can put pressure on HDFS and affect processing efficiency by merging the results files of Map and Reduce to try to eliminate such effects
Hive.merge.mapfiles = True if and Map output file, default to True
Hive.merge.mapredfiles = False if the Reduce output file is merged, the default is False
Hive.merge.size.per.task = 256*1000*1000 the size of the merged file.
The parameters here are not written in the table above because this can be set temporarily depending on the task, not necessarily the global setting. Sometimes the global setting has a performance impact on the operation of large files instead.
using partitions, Rcfile , Lzo , Orcfile wait
Each partition in hive corresponds to a directory on HDFs, and the partition column is not an actual field in the table, but one or more pseudo-columns, which in fact do not save the information and data for the partitioned column in the table's data file. The partition keyword is preceded by the primary partition (only one), followed by the secondary partition
Static partitioning: Static partitions need to be specified in SQL statements when loading data and using them
Example: (stat_date= ' 20120625 ', province= ' Hunan ')
Dynamic partitioning: Use dynamic partitioning to set the Hive.exec.dynamic.partition parameter value to True, the default value is False, and by default, Hive assumes that the primary partition is statically partitioned, the secondary partition uses dynamic partitioning, and if you want to use dynamic partitioning, you need to set Hive.exec.dynamic.partition.mode=nostrick, default is Strick
Example: (stat_date= ' 20120625 ', province)
Hive Data Skew