Multiple keys in hive require group

Source: Internet
Author: User

May I ask, if there are multiple statistics, the data to be calculated is the same, and each statistics is only different from the group by key,
What can we do to make these statistics the fastest possible?

Well, for example, there are 10 statistics, each of which reads the same data, but the dimensions of the statistics are different, that is, the keys of group by are different.
What do you do? The statement is similar to this: From (
Select K1, K2, K3 ffrom table
) TMP

Insert directory... select K1, count (1) group by K1 using 'replicated, skewed ....'
Insert directory... select K2, count (1) group by K2 although hive scans the original table once and then performs the following operations, basically each group by has a mapreduce job, more importantly, considering map optimization and data skew, a group by may require two Mr Jobs. Is there a good way ?? Direct implementation using MR: 1. Data input is the original table. 2. The key is the key design. The key of the group is inherited by a basekey, this basekey is the type of the map output key. 3. The partition design is based on the actual type of the key, so that the same key can be divided into the same reduce for aggregation. 4. Change the file name for the result (you can set the marker in the first line of reduce output) ------------------- defect: the number of reduce tasks must be greater than or equal to the number of groups, in this case, a large amount of data is unfavorable.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.