Research on hive Big Data deduplication

Source: Internet
Author: User
Research on Big Data de-duplication in hive inventory table: store incremental table: inre field: 1. p_key remove duplicate primary key 2. w_sort sort by 3.info other information method 1 (unionall + row_number () over): insertoverwritetablelimao_storeselectp_key, sort_wordfrom (selecttmp1. *, row_num

Research on Big Data de-duplication in hive inventory table: store incremental table: inre field: 1. p_key remove duplicate primary key 2. w_sort sorting is based on 3. other info information method 1 (union all + row_number () over): insert overwrite table limao_store select p_key, sort_word from (select tmp1. *, row_num

Research on hive Big Data deduplication

Inventory table: store

Incremental table: incret

Field:

1. p_key deduplication primary key

2. sort by w_sort

3. info Other information

Method 1 (union all + row_number () over): insert overwrite table limao_store select p_key, sort_word from (select tmp1. *, row_number () over (distripartition by sort_word sort by p_key desc) rownum from (select * from limao_store union all select * from limao_incres) tmp1) hh where hh. rownum = 1; analysis, long table sorting method 2 (left outer join + union all): Note: hive does not support union all at the top level, and the union all result must have an alias insert overwrite table limao _ Store select t. p_key, t. sort_word from (select s. p_key, s. sort_word from limao_store s left outer join limao_increi on (s. p_key = I. p_key) where I. p_key = null union all select p_key, sort_word from limao_incres); analysis: the Association of long tables with duplicate data in incres cannot be identified. Method 3 of doubling the table width (left outer join + insert) insert overwrite table store select s. * from store s left outer join increr I on (s. p_key = I. p_key) where I. p_key = null insert int O table jm_g_l_cust_secu_acct select * from jm_g_l_cust_secu_acct_tmp; analysis: insert into is not recommended. In hdfs, insert into is used to create a new file in the table (partition) folder to store insert into data, resulting in file fragmentation, reducing the query efficiency of the table in the future. ========================================================== ========================================================== = Use nets_life; create table limao_store (p_key string, sort_word string) row format delimited fields terminated by ', 'stored as textfile; create table limao_inre (p_key string, sort_word string) row format delimited fields terminated by ', 'stored as textfile; table creation statement use nets_life; create table limao_store (p_key string, sort_word string) row format delimited fields terminated ', 'stored as textfile; create table limao_inre (p_key string, sort_word string) row format delimited fields terminated by', 'stored as textfile; ========================================================== ========================================================== ========================================================== ==================================

Summary: method 2 is the same as method 3. Method 3 recommended to avoid

Method 2 and method 3 implicit logic:

1. When the incremental synchronization data (incres) conflicts with the existing data (store), it is always considered that the incremental data is the latest

2. No repeated fields exist in the table regardless of the incremental data table or the existing data table.

Method 1 does not imply the above logic. Merge all, ranking first by sorting field strictly

10 million data store and 1 million data incret test results

Method 1: Time taken: 317.677 seconds

Method 2: Time taken: 106.032 seconds

Conclusion: method 2 is significantly less used than method 1, but it does not have the internal deduplication function and can only be used for comparison and deduplication.

========================================================== ======

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.