Import incremental data from the basic business table in Oracle to Hive and merge it with the current full table into the latest full table. Import Oracle tables to Hive through Sqoop to simulate full scale and
Import incremental data from the basic business table in Oracle to Hive and merge it with the current full table into the latest full table. Import Oracle tables to Hive through Sqoop to simulate full scale and
Requirement
Import incremental data from the basic business table in Oracle to Hive and merge it with the current full table into the latest full table.
Design
Three tables involved:
Steps:
Import tables from Oracle to Hive through Sqoop to simulate full scale and incremental scale
Use Hive to merge "full table + incremental table" into "updated full table", covering the current full table
Step 1: Use Sqoop to import the tables in Oracle to Hive to simulate the full scale and incremental scale
To simulate the scenario, a full table and an incremental table are required. Because the data source is limited, both tables are from the OMP_SERVICE in Oracle. The full table contains all the data ,, in Hive, the name is service_all, the incremental table contains some time period data, and the name is service_tmp in Hive.
(1) full table import: export all data, as long as some fields are imported to the Hive specified table.
To implement the Hive import function, you must first configure the environment variable of HCatalog (HCatalog is a Hive submodule), and add the following in/etc/profile:
Export HCAT_HOME =/home/fulong/Hive/apache-hive-0.13.1-bin/hcatalog
Run the following command to import data:
Fulong @ FBI006 :~ /Sqoop/sqoop-1.4.4/bin $./sqoop import \
> -- Connect jdbc: oracle: thin: @ 192.168.0.147: 1521: ORCLGBK -- username SP -- password fulong \
> -- Table OMP_SERVICE \
> -- Columns "SERVICE_CODE, SERVICE_NAME, SERVICE_PROCESS, CREATE_TIME, ENABLE_ORG, ENABLE_PLATFORM, IF_DEL "\
> -- Hive-import -- hive-table SERVICE_ALL
Note: The username must be in uppercase.
(2) incremental table import: Only data within the required time range can be exported. As long as some fields are required, import them to the Hive specified table.
Run the following command to import data:
Fulong @ FBI006 :~ /Sqoop/sqoop-1.4.4/bin $./sqoop import \
> -- Connect jdbc: oracle: thin: @ 192.168.0.147: 1521: ORCLGBK -- username SP -- password fulong \
> -- Table OMP_SERVICE \
> -- Columns "SERVICE_CODE, SERVICE_NAME, SERVICE_PROCESS, CREATE_TIME, ENABLE_ORG, ENABLE_PLATFORM, IF_DEL "\
> -- Where "CREATE_TIME> to_date ('2017/4 17:00:00 ', 'yyyy-mm-dd hh24: mi: ss') and CREATE_TIME <to_date ('2017/4 18:00:00 ', 'yyyy-mm-dd hh24: mi: ss ')"\
> -- Hive-import -- hive-overwrite -- hive-table SERVICE_TMP
Note:
Because the -- hive-overwrite parameter is used, this statement can be executed repeatedly to overwrite the latest incremental data into the service_tmp table;
Sqoop also supports the use of complex SQL statements to Query data Imports. For more information, see section 7.2.3.Free-form Query Imports.
(3) Verify the import result: list all tables, count the number of rows, and view the table structure
Hive> show tables;
OK
Searchlog
Searchlog_tmp
Service_all
Service_tmp
Time taken: 0.04 seconds, Fetched: 4 row (s)
Hive> select count (*) from service_all;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes ):
Set hive.exe c. reducers. bytes. per. Cer CER =
In order to limit the maximum number of specified CERs:
Set hive.exe c. Fetch CERs. max =
In order to set a constant number of specified CERs:
Set mapreduce. job. CES =
Starting Job = job_1407233914535_0013, Tracking URL =: 8088/proxy/application_1407233914535_0013/
Kill Command =/home/fulong/Hadoop/hadoop-2.2.0/bin/hadoop job-kill job_1407233914535_0013
Hadoop job information for Stage-1: number of mappers: 3; number of concurrent CERs: 1
16:51:47, 389 Stage-1 map = 0%, reduce = 0%
16:51:59, 816 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 1.36 sec
16:52:01, 996 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 2.45 sec
16:52:07, 877 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.96 sec
16:52:17, 639 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.29 sec
MapReduce Total cumulative CPU time: 5 secondds 290 msec
Ended Job = job_1407233914535_0013
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 1 Cumulative CPU: 5.46 sec HDFS Read: 687141 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 5 secondds 460 msec
OK
6803
Time taken: 59.386 seconds, Fetched: 1 row (s)
Hive> select count (*) from service_tmp;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes ):
Set hive.exe c. reducers. bytes. per. Cer CER =
In order to limit the maximum number of specified CERs:
Set hive.exe c. Fetch CERs. max =
In order to set a constant number of specified CERs:
Set mapreduce. job. CES =
Starting Job = job_1407233914535_0014, Tracking URL =: 8088/proxy/application_1407233914535_0014/
Kill Command =/home/fulong/Hadoop/hadoop-2.2.0/bin/hadoop job-kill job_1407233914535_0014
Hadoop job information for Stage-1: number of mappers: 3; number of concurrent CERs: 1
16:53:03, 951 Stage-1 map = 0%, reduce = 0%
16:53:15, 189 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 2.17 sec
16:53:16, 236 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.38 sec
16:53:57, 935 Stage-1 map = 100%, reduce = 22%, Cumulative CPU 3.78 sec
16:54:01, 811 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.34 sec
MapReduce Total cumulative CPU time: 5 secondds 340 msec
Ended Job = job_1407233914535_0014
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 1 Cumulative CPU: 5.66 sec HDFS Read: 4720 HDFS Write: 3 SUCCESS
Total MapReduce CPU Time Spent: 5 secondds 660 msec
OK
13
Time taken: 75.856 seconds, Fetched: 1 row (s)
Hive> describe service_all;
OK
Service_code string
Service_name string
Service_process string
Create_time string
Enable_org string
Enable_platform string
If_del string
Time taken: 0.169 seconds, Fetched: 7 row (s)
Hive> describe service_tmp;
OK
Service_code string
Service_name string
Service_process string
Create_time string
Enable_org string
Enable_platform string
If_del string
Time taken: 0.117 seconds, Fetched: 7 row (s)
The logic for merging new tables is as follows:
Enter the entire tmp table into the final table
All table data not included in the range of service_code of tmp table enters the new table.
Run the following SQL statement to merge the updated full table: