017-hadoop Hive SQL Syntax 7-de-reordering, data skew

Source: Internet
Author: User

First, the data to re-order

1.1, go to Heavy

Distinct and GROUP by

Try to avoid using distinct for weight, especially large table operations, using GROUP by instead

-- Not recommended Select DISTINCT Key  from a -- Recommended Select Key  from Group  by Key

1.2. Sorting optimization

Only order by produces a globally ordered result, which can be sorted according to the actual scenario.

1, order by to achieve global ordering, a reduce implementation, because can not be executed concurrently, so the efficiency is low

2, sort by to achieve partial order, the result of a single reduce output is orderly, high efficiency, and usually used with the distribute by keyword

(The Distribute by keyword can specify the distribution key of the map to the reduce side)

3, cluster by col1 equivalent to distribute by col1 sort by col1 but cannot specify collation

Second, data tilt

The task progresses for a long time at 99% (or 100%), viewing the Task monitoring page and discovering that only a small number (one or several) of the reduce subtasks are not completed. Because the amount of data processed and other reduce differences are too large.

The number of records in a single reduce differs too much from the average number of records, which can typically reach 30 times times or more. The longest term is longer than the average length.

017-hadoop Hive SQL Syntax 7-de-reordering, data skew

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.