"Gandalf" Sqoop1.4.4 implements the import of incremental data from oracle10g into Hive0.13.1 and updates the primary table in hive

Source: Internet
Author: User
Tags sqoop

DemandImport the Business base table Delta data from Oracle into Hive, merging with the current full scale into the latest full scale. * * * Welcome reprint, please indicate the source * * *
http://blog.csdn.net/u010967382/article/details/38735381 Design three sheets involved:
    • Full scale: a full-scale base data table with the last synchronization time saved
    • Delta Tables : Incremental temporary tables
    • updated full-scale table : Updated full-scale data sheet

steps:
    1. Import tables from Oracle into hive through Sqoop, simulating full-scale and Delta tables
    2. "Full scale + Delta table" is merged into "updated full scale" through hive, overwriting the current full scale
Step 1: Import the tables from Oracle into hive by Sqoop, simulating full-scale and Delta tablesIn order to simulate a scenario, a full scale table and a Delta table are required, and since the data source is limited, two tables are from Omp_service in Oracle, the full scale contains all the data, the name is Service_all in Hive, and the Delta table contains part of the time period data, In Hive, the name is called Service_tmp. (1) full Scale import: Export all data, as long as part of the field, import into the hive specified tableTo implement the Import hive feature, you need to first configure the environment variables for Hcatalog (Hcatalog is a hive submodule), which is added in/etc/profile:Export Hcat_home=/home/fulong/hive/apache-hive-0.13.1-bin/hcatalog

Execute the following command to import the data:[email protected]:~/sqoop/sqoop-1.4.4/bin$./sqoop Import \>--connect jdbc:oracle:thin:@192.168.0.147:1521:orclgbk--username SP--password fulong \>--table omp_service \>--columns "Service_code,service_name,service_process,create_time,enable_org,enable_platform,if_del" \ >--hive-import--hive-table service_all
Note: User names must be capitalized (2) Delta table Import: Only the data within the desired time range is exported, as long as some fields are imported into the hive specified tableUse the following command to import data:[email protected]:~/sqoop/sqoop-1.4.4/bin$./sqoop Import \>--connect jdbc:oracle:thin:@192.168.0.147:1521:orclgbk--username SP--password fulong \>--table omp_service \>--columns "Service_code,service_name,service_process,create_time,enable_org,enable_platform,if_del" \ >--where "create_time > To_date (' 2012/12/4 17:00:00 ', ' yyyy-mm-dd hh24:mi:ss ') and Create_time < To_date (' 2012/12/4 18:00:00 ', ' yyyy-mm-dd hh24:mi:ss ') "\>--hive-import--hive-overwrite--hive-table service_tmp
Note:
    1. Because the --hive-overwrite parameter is used, the statement can be executed repeatedly, overwriting the latest incremental data into the SERVICE_TMP table;
    2. Sqoop also supports querying data imports using complex SQL statements, see http://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html's "7.2.3.free-form query Imports "Chapter
(3) Validate Import results: List all tables, count rows, view table structureHive>show tables;Oksearchlogsearchlog_tmpService_allservice_tmpTime taken:0.04 seconds, Fetched:4 row (s) hive>Select COUNT (*) from Service_all;Total Jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time:1in order to change the average L Oad for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>in order to limit the maximum n Umber of reducers:  set hive.exec.reducers.max=<number>in order to set a constant number of reducers:  set mapreduce.job.reduces=<number>starting job = job_1407233914535_0013, Tracking URL = http://FBI003:8088/proxy/ Application_1407233914535_0013/kill Command =/home/fulong/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_ 1407233914535_0013hadoop Job information for Stage-1: number of mappers:3; Number of reducers:12014-08-21 16:51:47,389 Stage-1 map = 0,  reduce = 0%2014-08-21 16:51:59,816 Stage-1 map = 33%,  reduce = 0, Cumulative CPU 1.36 sec2014-08-21 16:52:01,996 Stage-1 map = 67%,  reduce = 0, Cumulative CPU 2. sec2014-08-21 16:52:07,877 Stage-1 map = 100%,  reduce = 0, Cumulative CPU 3.96 sec2014-08-21 16:52:17,639 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.29 secmapreduce Total Cumulative CPU Time:5 seconds 290 msecended Job = job_1407233914535_0013mapreduce Jobs launched:job 0:map:3  reduce:1   Cu Mulative cpu:5.46 sec   HDFs read:687141 hdfs write:5 successtotal MapReduce CPU time spent:5 seconds 460 Msecok6803Time taken:59.386 seconds, fetched:1 row (s) hive>Select COUNT (*) from service_tmp;Total Jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time:1in order to change the average L Oad for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=<number>in order to limit the maximum n Umber of reducers:  set hive.exec.reducers.max=<number>in order to set a constant number of reducers:  set mapreduce.job.reduces=<number>starting job = job_1407233914535_0014, Tracking URL = http://FBI003:8088/proxy/ Application_1407233914535_0014/kill Command =/home/fulong/hadoop/hadoop-2.2.0/bin/hadoop job  -kill job_ 1407233914535_0014hadoop Job information for Stage-1: number of mappers:3; Number of reducers:12014-08-21 16:53:03,951 Stage-1 map = 0,  reduce = 0%2014-08-21 16:53:15,189 Stage-1 map = 67%,  reduce = 0, Cumulative CPU 2.17 sec2014-08-21 16:53:16,236 Stage-1 map = 100%,  reduce = 0, Cumulative CPU 3 . sec2014-08-21 16:53:57,935 Stage-1 map = 100%,  reduce = 22%, Cumulative CPU 3.sec2014-08-21 16:54:01,811 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.34 secmapreduce Total Cumulative C PU time:5 seconds 340 msecended Job = job_1407233914535_0014mapreduce Jobs launched:job 0:map:3  reduce:1   Cumulative cpu:5.66 sec   HDFs read:4720 hdfs write:3 successtotal MapReduce CPU time spent:5 seconds 660 Msecok -Time taken:75.856 seconds, fetched:1 row (s) hive>describe Service_all;Okservice_code stringservice_name stringservice_process Stringcreate_time string enable_org stringenable_platform Stringif_del stringtime taken:0.169 seconds, Fetch Ed:7 row (s) hive>describe service_tmp;Okservice_code            stringservice_name            string Service_process         stringcreate_time             stringenable_org & nbsp            stringenable_platform         Stringif_del     &NBS P            stringtime taken:0.117 seconds, Fetched:7 row (s) Step 2: Pass through hive to merge "full scale + Delta table" into "updated fullscale", covering the current full scaleThe logic for merging the new table is as follows:
    • The entire TMP table enters the final table
    • All data in the all table is not included in the Service_code range of the TMP table all into the new table
The following SQL statement can be executed to merge the updated full scale table:Hive> SELECT * from Service_tmpUnion AllSelect A.* from Service_all a left outer join service_tmp b in a.service_code = B.service_code where B.service_code is null;
We need to update the query results directly back to the full scale:hive> Insert Overwrite table service_all SELECT * FROM service_tmp UNION ALL select A.* from Service_all a left out Er join service_tmp b on a.service_code = B.service_code where b.service_code is null;

Note that inserting the query results into the table has the following two classes of syntax:
    • INSERT OVERWRITE TABLE tablename1 [PARTITION (Partcol1=val1, Partcol2=val2 ...) [IF not EXISTS]] Select_statement1 from From_statement;
    • INSERT into TABLE tablename1 [PARTITION (Partcol1=val1, Partcol2=val2 ...)] select_statement1 from From_statement;
The INSERT OVERWRITE overwrites the existing data, and the overlay mode is used because the current scene needs to update the full scale.
INSERT into does not overwrite existing data, is append dataSo far, the Service_all table in hive has been updated to the latest data! In real-world scenarios, it is necessary to implement timed execution of the process in conjunction with Shell+cron.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.