"Gandalf" Sqoop1.4.4 implements the import of incremental data from oracle10g into Hive0.13.1 and updates the primary table in hive

Last Update:2014-08-21 Source: Internet

Author: User

Tags sqoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

DemandImport the Business base table Delta data from Oracle into Hive, merging with the current full scale into the latest full scale. * * * Welcome reprint, please indicate the source * * *
http://blog.csdn.net/u010967382/article/details/38735381 Design three sheets involved:

Full scale: a full-scale base data table with the last synchronization time saved
Delta Tables : Incremental temporary tables
updated full-scale table : Updated full-scale data sheet

steps:

Import tables from Oracle into hive through Sqoop, simulating full-scale and Delta tables
"Full scale + Delta table" is merged into "updated full scale" through hive, overwriting the current full scale

Step 1: Import the tables from Oracle into hive by Sqoop, simulating full-scale and Delta tablesIn order to simulate a scenario, a full scale table and a Delta table are required, and since the data source is limited, two tables are from Omp_service in Oracle, the full scale contains all the data, the name is Service_all in Hive, and the Delta table contains part of the time period data, In Hive, the name is called Service_tmp. (1) full Scale import: Export all data, as long as part of the field, import into the hive specified tableTo implement the Import hive feature, you need to first configure the environment variables for Hcatalog (Hcatalog is a hive submodule), which is added in/etc/profile:Export Hcat_home=/home/fulong/hive/apache-hive-0.13.1-bin/hcatalog

Execute the following command to import the data:[email protected]:~/sqoop/sqoop-1.4.4/bin$./sqoop Import \>--connect jdbc:oracle:thin:@192.168.0.147:1521:orclgbk--username SP--password fulong \>--table omp_service \>--columns "Service_code,service_name,service_process,create_time,enable_org,enable_platform,if_del" \ >--hive-import--hive-table service_all
Note: User names must be capitalized (2) Delta table Import: Only the data within the desired time range is exported, as long as some fields are imported into the hive specified tableUse the following command to import data:[email protected]:~/sqoop/sqoop-1.4.4/bin$./sqoop Import \>--connect jdbc:oracle:thin:@192.168.0.147:1521:orclgbk--username SP--password fulong \>--table omp_service \>--columns "Service_code,service_name,service_process,create_time,enable_org,enable_platform,if_del" \ >--where "create_time > To_date (' 2012/12/4 17:00:00 ', ' yyyy-mm-dd hh24:mi:ss ') and Create_time < To_date (' 2012/12/4 18:00:00 ', ' yyyy-mm-dd hh24:mi:ss ') "\>--hive-import--hive-overwrite--hive-table service_tmp
Note:

Because the --hive-overwrite parameter is used, the statement can be executed repeatedly, overwriting the latest incremental data into the SERVICE_TMP table;
Sqoop also supports querying data imports using complex SQL statements, see http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html's "7.2.3.free-form query Imports "Chapter

(3) Validate Import results: List all tables, count rows, view table structureHive>show tables;Oksearchlogsearchlog_tmpService_allservice_tmpTime taken:0.04 seconds, Fetched:4 row (s) hive>Select COUNT (*) from Service_all;Total Jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time:1in order to change the average L Oad for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>in order to limit the maximum n Umber of reducers: set hive.exec.reducers.max=<number>in order to set a constant number of reducers: set mapreduce.job.reduces=<number>starting job = job_1407233914535_0013, Tracking URL = http://FBI003:8088/proxy/ Application_1407233914535_0013/kill Command =/home/fulong/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_ 1407233914535_0013hadoop Job information for Stage-1: number of mappers:3; Number of reducers:12014-08-21 16:51:47,389 Stage-1 map = 0, reduce = 0%2014-08-21 16:51:59,816 Stage-1 map = 33%, reduce = 0, Cumulative CPU 1.36 sec2014-08-21 16:52:01,996 Stage-1 map = 67%, reduce = 0, Cumulative CPU 2. sec2014-08-21 16:52:07,877 Stage-1 map = 100%, reduce = 0, Cumulative CPU 3.96 sec2014-08-21 16:52:17,639 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.29 secmapreduce Total Cumulative CPU Time:5 seconds 290 msecended Job = job_1407233914535_0013mapreduce Jobs launched:job 0:map:3 reduce:1 Cu Mulative cpu:5.46 sec HDFs read:687141 hdfs write:5 successtotal MapReduce CPU time spent:5 seconds 460 Msecok6803Time taken:59.386 seconds, fetched:1 row (s) hive>Select COUNT (*) from service_tmp;Total Jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time:1in order to change the average L Oad for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>in order to limit the maximum n Umber of reducers: set hive.exec.reducers.max=<number>in order to set a constant number of reducers: set mapreduce.job.reduces=<number>starting job = job_1407233914535_0014, Tracking URL = http://FBI003:8088/proxy/ Application_1407233914535_0014/kill Command =/home/fulong/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_ 1407233914535_0014hadoop Job information for Stage-1: number of mappers:3; Number of reducers:12014-08-21 16:53:03,951 Stage-1 map = 0, reduce = 0%2014-08-21 16:53:15,189 Stage-1 map = 67%, reduce = 0, Cumulative CPU 2.17 sec2014-08-21 16:53:16,236 Stage-1 map = 100%, reduce = 0, Cumulative CPU 3 . sec2014-08-21 16:53:57,935 Stage-1 map = 100%, reduce = 22%, Cumulative CPU 3.sec2014-08-21 16:54:01,811 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.34 secmapreduce Total Cumulative C PU time:5 seconds 340 msecended Job = job_1407233914535_0014mapreduce Jobs launched:job 0:map:3 reduce:1 Cumulative cpu:5.66 sec HDFs read:4720 hdfs write:3 successtotal MapReduce CPU time spent:5 seconds 660 Msecok -Time taken:75.856 seconds, fetched:1 row (s) hive>describe Service_all;Okservice_code stringservice_name stringservice_process Stringcreate_time string enable_org stringenable_platform Stringif_del stringtime taken:0.169 seconds, Fetch Ed:7 row (s) hive>describe service_tmp;Okservice_code stringservice_name string Service_process stringcreate_time stringenable_org & nbsp stringenable_platform Stringif_del &NBS P stringtime taken:0.117 seconds, Fetched:7 row (s) Step 2: Pass through hive to merge "full scale + Delta table" into "updated fullscale", covering the current full scaleThe logic for merging the new table is as follows:

The entire TMP table enters the final table
All data in the all table is not included in the Service_code range of the TMP table all into the new table

The following SQL statement can be executed to merge the updated full scale table:Hive> SELECT * from Service_tmpUnion AllSelect A.* from Service_all a left outer join service_tmp b in a.service_code = B.service_code where B.service_code is null;
We need to update the query results directly back to the full scale:hive> Insert Overwrite table service_all SELECT * FROM service_tmp UNION ALL select A.* from Service_all a left out Er join service_tmp b on a.service_code = B.service_code where b.service_code is null;

Note that inserting the query results into the table has the following two classes of syntax:

INSERT OVERWRITE TABLE tablename1 [PARTITION (Partcol1=val1, Partcol2=val2 ...) [IF not EXISTS]] Select_statement1 from From_statement;
INSERT into TABLE tablename1 [PARTITION (Partcol1=val1, Partcol2=val2 ...)] select_statement1 from From_statement;

The INSERT OVERWRITE overwrites the existing data, and the overlay mode is used because the current scene needs to update the full scale.
INSERT into does not overwrite existing data, is append dataSo far, the Service_all table in hive has been updated to the latest data! In real-world scenarios, it is necessary to implement timed execution of the process in conjunction with Shell+cron.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More