May I ask, if there are multiple statistics, the data to be calculated is the same, and each statistics is only different from the group by key,
What can we do to make these statistics the fastest possible?
Well, for example, there are 10 statistics, each of which reads the same data, but the dimensions of the statistics are different, that is, the keys of group by are different.
What do you do? The statement is similar to this: From (
Select K1, K2, K3 ffrom table
) TMP
Insert directory... select K1, count (1) group by K1 using 'replicated, skewed ....'
Insert directory... select K2, count (1) group by K2 although hive scans the original table once and then performs the following operations, basically each group by has a mapreduce job, more importantly, considering map optimization and data skew, a group by may require two Mr Jobs. Is there a good way ?? Direct implementation using MR: 1. Data input is the original table. 2. The key is the key design. The key of the group is inherited by a basekey, this basekey is the type of the map output key. 3. The partition design is based on the actual type of the key, so that the same key can be divided into the same reduce for aggregation. 4. Change the file name for the result (you can set the marker in the first line of reduce output) ------------------- defect: the number of reduce tasks must be greater than or equal to the number of groups, in this case, a large amount of data is unfavorable.