Research on Big Data de-duplication in hive inventory table: store incremental table: inre field: 1. p_key remove duplicate primary key 2. w_sort sort by 3.info other information method 1 (unionall + row_number () over): insertoverwritetablelimao_storeselectp_key, sort_wordfrom (selecttmp1. *, row_num
Research on Big Data de-duplication in hive inventory table: store incremental table: inre field: 1. p_key remove duplicate primary key 2. w_sort sorting is based on 3. other info information method 1 (union all + row_number () over): insert overwrite table limao_store select p_key, sort_word from (select tmp1. *, row_num
Research on hive Big Data deduplication
Inventory table: store
Incremental table: incret
Field:
1. p_key deduplication primary key
2. sort by w_sort
3. info Other information
Method 1 (union all + row_number () over): insert overwrite table limao_store select p_key, sort_word from (select tmp1. *, row_number () over (distripartition by sort_word sort by p_key desc) rownum from (select * from limao_store union all select * from limao_incres) tmp1) hh where hh. rownum = 1; analysis, long table sorting method 2 (left outer join + union all): Note: hive does not support union all at the top level, and the union all result must have an alias insert overwrite table limao _ Store select t. p_key, t. sort_word from (select s. p_key, s. sort_word from limao_store s left outer join limao_increi on (s. p_key = I. p_key) where I. p_key = null union all select p_key, sort_word from limao_incres); analysis: the Association of long tables with duplicate data in incres cannot be identified. Method 3 of doubling the table width (left outer join + insert) insert overwrite table store select s. * from store s left outer join increr I on (s. p_key = I. p_key) where I. p_key = null insert int O table jm_g_l_cust_secu_acct select * from jm_g_l_cust_secu_acct_tmp; analysis: insert into is not recommended. In hdfs, insert into is used to create a new file in the table (partition) folder to store insert into data, resulting in file fragmentation, reducing the query efficiency of the table in the future. ========================================================== ========================================================== = Use nets_life; create table limao_store (p_key string, sort_word string) row format delimited fields terminated by ', 'stored as textfile; create table limao_inre (p_key string, sort_word string) row format delimited fields terminated by ', 'stored as textfile; table creation statement use nets_life; create table limao_store (p_key string, sort_word string) row format delimited fields terminated ', 'stored as textfile; create table limao_inre (p_key string, sort_word string) row format delimited fields terminated by', 'stored as textfile; ========================================================== ========================================================== ========================================================== ==================================
Summary: method 2 is the same as method 3. Method 3 recommended to avoid
Method 2 and method 3 implicit logic:
1. When the incremental synchronization data (incres) conflicts with the existing data (store), it is always considered that the incremental data is the latest
2. No repeated fields exist in the table regardless of the incremental data table or the existing data table.
Method 1 does not imply the above logic. Merge all, ranking first by sorting field strictly
10 million data store and 1 million data incret test results
Method 1: Time taken: 317.677 seconds
Method 2: Time taken: 106.032 seconds
Conclusion: method 2 is significantly less used than method 1, but it does not have the internal deduplication function and can only be used for comparison and deduplication.
========================================================== ======