Implement incremental update in Hive
An insurance company has a table that records the customer information, including the customer's id, name, and age (only these fields are listed for demonstration ).
Create a Hive table:
Create table customer
(
Id int,
Age tinyint,
Name string
)
Partitioned by (dt string)
Row format delimited
Fields terminated by '|'
Stored as textfile;
Import initialization data:
Load data local inpath '/home/hadoop/hivetestdata/customer.txt 'into table customer partition (dt = '20140901 ');
Hive> select * from customer order by id;
Customer. id customer. agecustomer. name customer. dt
1 25 jiangshouzhuang201506
2 23 zhangyun201506
3 24 yiyi201506
4 32 mengmeng201506
For insurance companies, customers change every day. We use the temporary data table customer_temp to record customer information every day. The fields and attributes are consistent with the customer table,
Create table customer_temp like customer;
Load data local inpath '/home/hadoop/hivetestdata/customer_temp.txt' into table customer_temp partition (dt = '20140901 ');
Examples of included data are as follows:
Hive> select * from customer_temp;
Customer_temp.id customer_temp.agecustomer_temp.name customer_temp.dt
1 26 jiangshouzhuang201506
5 45 xiaosan201506
To incrementally update the customer table, we need to perform full outer join on the two tables to update the modified data in the customer_temp table to the customer table.
Hive (hive)> select * from customer_temp
> Union all
> Select a. * from customer
> Left outer join customer_temp B
> On a. id = B. id where B. id is null;
_ U1.id _ u1.age _ u1.name _ u1.dt
2 23 zhangyun201506
3 24 yiyi201506
4 32 mengmeng201506
1 26 jiangshouzhuang201506
5 45 xiaosan201506
I have seen that the following method is used on the Internet, and it seems that there is a problem:
Hive> select customer. id,
Coalesce (customer_temp.age, customer. age ),
Customer. name,
Coalesce (customer_temp.dt, customer. dt)
From customer_temp
Full outer join customer on customer_temp.id = customer. id;
The execution result is:
Customer. id _ c1customer. name _ c3
1 26 jiangshouzhuang201506
2 23 zhangyun201506
3 24 yiyi201506
4 32 mengmeng201506
NULL 45 NULL 201506
We can see that there is indeed a problem.
If you have any better optimization methods, please kindly advise. Thank you.