Previously, I sorted out the GROUPING SETS clause features introduced by Hive 0.10, and performed simple syntax usage experience and data verification. However, I didn't notice a little complicated at the time. Then, I took a proper shot in the actual use process.
This is the case when a JOIN operation occurs: I want to count the UV and VV data for the fields a, B, and c in data_table of Hive tables, the statistical data from c to B and then to a is required. At the same time, we need to map the value of field B through JOIN with another table that specifically describes the value of B. The HQL statement is as follows:
Select t1.a _ desc, if (t1.group _ bitvector = 1, 'undistinguished ', t2. B _ desc) as B _desc, if (t1.group _ bitvector = 1 or t1.group _ bitvector = 3, 'undistinguished ', t1.c _ desc) as c_desc, t1.group _ bitvector, t1.act _ uv, t1.act _ vvfrom (select a_desc, B, c_desc, grouping _ id as group_bitvector, count (distinctuer_ID) as act_uv, (sum (if (vv_ID is null, 1, 0) + sum (if (vv_ID = "", 1, 0 )) + count (distinct if (vv_ID is not null, if (vv_ID! = "", Vv_ID, null), null ))) as act_vv from (select (case a when 0 then 'str _ val_1 'when 1 then' str _ val_2 'when 2 then' str _ val_3 'else' str _ val_4 'end) as a_desc, B, if (c =-2 or c =-1 or c = 9, 'c _ desc1', 'c _ desc2') as c_desc, uer_ID, vv_ID from data_table where a = xxx) t group by a_desc, B, c_desc grouping sets (a_desc, (a_desc, B), (a_desc, B, c_desc ))) t1 join dim_table t2 on (t1. B = t2. B)
When we run the data, we can see that it is normal. If we look at it carefully, we can't stand the test of our eyes. The group_bitvector column seems to be missing a value, so that grouping sets will come down, group_bitvector should have the values 1, 3, and 7, but the data is not 1. That is to say, the top-level aggregation (only performing aggregation on a) is missing!
After thinking hard, I found this subtle mistake. JOIN threw away some data! Does grouping sets set the columns not involved in aggregation in the group by clause to NULL? These NULL values are ruthlessly discarded during JOIN... At this time, left outer join stood up bravely. It said: As long as I replace the original JOIN (internal JOIN), we can effectively eliminate the problem of data filtering:
Select t1.a _ desc, if (t1.group _ bitvector = 1, 'undistinguished ', t2. B _ desc) as B _desc, if (t1.group _ bitvector = 1 or t1.group _ bitvector = 3, 'undistinguished ', t1.c _ desc) as c_desc, t1.group _ bitvector, t1.act _ uv, t1.act _ vvfrom (select a_desc, B, c_desc, grouping _ id as group_bitvector, count (distinctuer_ID) as act_uv, (sum (if (vv_ID is null, 1, 0) + sum (if (vv_ID = "", 1, 0 )) + count (distinct if (vv_ID is not null, if (vv_ID! = "", Vv_ID, null), null ))) as act_vv from (select (case a when 0 then 'str _ val_1 'when 1 then' str _ val_2 'when 2 then' str _ val_3 'else' str _ val_4 'end) as a_desc, B, if (c =-2 or c =-1 or c = 9, 'c _ desc1', 'c _ desc2') as c_desc, uer_ID, vv_ID from data_table where a = xxx) t group by a_desc, B, c_desc grouping sets (a_desc, (a_desc, B), (a_desc, B, c_desc ))) t1 left outer join dim_table t2 on (t1. B = t2. B)
In this scenario, the usefulness of the grouping _ id function is also apparent. Without it, the lack of data is more likely to be ignored...
Finally, if you JOIN multiple or multiple columns, pay special attention to this issue ~~~ With the return values of left outer join and groupingconsumer ID, you can still clearly express the need to query data and differentiate groups.
Finally, there is a small mine: the return value of gouping _ id is actually a string type, rather than an integer! Sometimes Hive will automatically parse integer and string types for us. After all, many string operations are similar to integer operations. But sometimes Hive won't handle it for us, such as using case... When... Then... When the end clause is used, the field data type must be strictly the same; otherwise, an error will be reported. -- although this error prompt is obvious, it is easier to troubleshoot.