Deduplication of GroupBy in Hive

Source: Internet
Author: User
In Hive, we often have this requirement: Perform GroupBy based on the same id and deduplicate the other field. For example, the following data is obtained: idpic1.jpg2.jpg1.jpg, neither DISTINCT nor 2col can be used to obtain GroupBy. We can use this UDAF: collect_set (col) to obtain ke for the same groupby.

In Hive, we often have this requirement: Perform Group By based on the same id and deduplicate the other field. For example, the following data is obtained: id pic1.jpg2.jpg1.jpg, neither DISTINCT nor 2 col can be used to get Group By. We can use this UDAF: collect_set (col), which will get ke for the same group

In Hive, we often have this requirement:

Perform Group By based on the same id and deduplicate the other field. For example, the following data is obtained:

Id pic1.jpg2.jpg1.jpg

At this time, it is impossible to get Group By using DISTINCT or 2 col. We can use this UDAF: collect_set (col ), it performs set deduplication on the key obtained by the same group by and converts it to an array.

For another example, We Can deduplicate and splice the image:
SELECT id, CONCAT_WS (',', COLLECT_SET (pic) FROM tbl group by id
Here CONCAT_WS is a UDF and COLLECT_SET is a UDAF. it de-duplicates the group pic and converts it to an array to facilitate udf usage.

PS: You can use COLLECT_LIST to remove duplicates.

For more UDAF, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF here

Original article address: deduplication of Group By in Hive. Thank you for sharing it with the original author.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.