In Hive, we often have this requirement: Perform GroupBy based on the same id and deduplicate the other field. For example, the following data is obtained: idpic1.jpg2.jpg1.jpg, neither DISTINCT nor 2col can be used to obtain GroupBy. We can use this UDAF: collect_set (col) to obtain ke for the same groupby.
In Hive, we often have this requirement: Perform Group By based on the same id and deduplicate the other field. For example, the following data is obtained: id pic1.jpg2.jpg1.jpg, neither DISTINCT nor 2 col can be used to get Group By. We can use this UDAF: collect_set (col), which will get ke for the same group
In Hive, we often have this requirement:
Perform Group By based on the same id and deduplicate the other field. For example, the following data is obtained:
Id pic1.jpg2.jpg1.jpg
At this time, it is impossible to get Group By using DISTINCT or 2 col. We can use this UDAF: collect_set (col ), it performs set deduplication on the key obtained by the same group by and converts it to an array.
For another example, We Can deduplicate and splice the image:
SELECT id, CONCAT_WS (',', COLLECT_SET (pic) FROM tbl group by id
Here CONCAT_WS is a UDF and COLLECT_SET is a UDAF. it de-duplicates the group pic and converts it to an array to facilitate udf usage.
PS: You can use COLLECT_LIST to remove duplicates.
For more UDAF, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF here
Original article address: deduplication of Group By in Hive. Thank you for sharing it with the original author.