Research on hive Big Data deduplication

Last Update:2018-06-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Research on Big Data de-duplication in hive inventory table: store incremental table: inre field: 1. p_key remove duplicate primary key 2. w_sort sort by 3.info other information method 1 (unionall + row_number () over): insertoverwritetablelimao_storeselectp_key, sort_wordfrom (selecttmp1. *, row_num

Research on Big Data de-duplication in hive inventory table: store incremental table: inre field: 1. p_key remove duplicate primary key 2. w_sort sorting is based on 3. other info information method 1 (union all + row_number () over): insert overwrite table limao_store select p_key, sort_word from (select tmp1. *, row_num

Research on hive Big Data deduplication

Inventory table: store

Incremental table: incret

Field:

1. p_key deduplication primary key

2. sort by w_sort

3. info Other information

Method 1 (union all + row_number () over): insert overwrite table limao_store select p_key, sort_word from (select tmp1. *, row_number () over (distripartition by sort_word sort by p_key desc) rownum from (select * from limao_store union all select * from limao_incres) tmp1) hh where hh. rownum = 1; analysis, long table sorting method 2 (left outer join + union all): Note: hive does not support union all at the top level, and the union all result must have an alias insert overwrite table limao _ Store select t. p_key, t. sort_word from (select s. p_key, s. sort_word from limao_store s left outer join limao_increi on (s. p_key = I. p_key) where I. p_key = null union all select p_key, sort_word from limao_incres); analysis: the Association of long tables with duplicate data in incres cannot be identified. Method 3 of doubling the table width (left outer join + insert) insert overwrite table store select s. * from store s left outer join increr I on (s. p_key = I. p_key) where I. p_key = null insert int O table jm_g_l_cust_secu_acct select * from jm_g_l_cust_secu_acct_tmp; analysis: insert into is not recommended. In hdfs, insert into is used to create a new file in the table (partition) folder to store insert into data, resulting in file fragmentation, reducing the query efficiency of the table in the future. ========================================================== ========================================================== = Use nets_life; create table limao_store (p_key string, sort_word string) row format delimited fields terminated by ', 'stored as textfile; create table limao_inre (p_key string, sort_word string) row format delimited fields terminated by ', 'stored as textfile; table creation statement use nets_life; create table limao_store (p_key string, sort_word string) row format delimited fields terminated ', 'stored as textfile; create table limao_inre (p_key string, sort_word string) row format delimited fields terminated by', 'stored as textfile; ========================================================== ========================================================== ========================================================== ==================================

Summary: method 2 is the same as method 3. Method 3 recommended to avoid

Method 2 and method 3 implicit logic:

1. When the incremental synchronization data (incres) conflicts with the existing data (store), it is always considered that the incremental data is the latest

2. No repeated fields exist in the table regardless of the incremental data table or the existing data table.

Method 1 does not imply the above logic. Merge all, ranking first by sorting field strictly

10 million data store and 1 million data incret test results

Method 1: Time taken: 317.677 seconds

Method 2: Time taken: 106.032 seconds

Conclusion: method 2 is significantly less used than method 1, but it does not have the internal deduplication function and can only be used for comparison and deduplication.

========================================================== ======

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Research on hive Big Data deduplication

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Research on hive Big Data deduplication

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support