First, the data to re-order
1.1, go to Heavy
Distinct and GROUP by
Try to avoid using distinct for weight, especially large table operations, using GROUP by instead
-- Not recommended Select DISTINCT Key from a -- Recommended Select Key from Group by Key
1.2. Sorting optimization
Only order by produces a globally ordered result, which can be sorted according to the actual scenario.
1, order by to achieve global ordering, a reduce implementation, because can not be executed concurrently, so the efficiency is low
2, sort by to achieve partial order, the result of a single reduce output is orderly, high efficiency, and usually used with the distribute by keyword
(The Distribute by keyword can specify the distribution key of the map to the reduce side)
3, cluster by col1 equivalent to distribute by col1 sort by col1 but cannot specify collation
Second, data tilt
The task progresses for a long time at 99% (or 100%), viewing the Task monitoring page and discovering that only a small number (one or several) of the reduce subtasks are not completed. Because the amount of data processed and other reduce differences are too large.
The number of records in a single reduce differs too much from the average number of records, which can typically reach 30 times times or more. The longest term is longer than the average length.
017-hadoop Hive SQL Syntax 7-de-reordering, data skew