First, data integration business Scenario
1.1 Background
Because of a system of GA to adjust, resulting in the original from the system backup database obtained from the corresponding data resources can not be properly obtained, the subsequent data unified by the GA a system daily ztry photo data from the previous day is compressed and uploaded to the designated FTP server, need to download from the FTP, After decompression, the JPEG photo data is parsed and loaded into the database.
1.2 Goals
Need to set up a job can be scheduled from the FTP download server to download the zip file to the local designated folder, the zip file extracted, to obtain JPEG format photos, through the kettle conversion of photos to parse, and according to certain rules will be the name of the photos and other information to parse into a unique primary key, In order to achieve the integration of local photo files into the database for storage, after the integration of the zip file to the backup folder, while the extracted folder and photo files deleted.
1.3 Tools and Components
Tools: IDI (ETL tool based on kettle two-time development)
Components: ① Conversions: Input components, output components,
② Job:
Second, data integration design ideas
2.1 Design ideas of Job scheduling
Design ideas as shown in the process:
1. Get the time variable
When the job starts, it calls the time variable transformation to capture the time variable and gets the date of the format processed yesterday.
2, check the directory situation
Go to the local designated download folder to see if the zip file for yesterday's date has been downloaded in the folder.
3. Execution judgment
If the download folder is empty, or does not have a zip file of yesterday's date, jump into the ftp download, otherwise, jump out of the job stop, waiting for the next job to start.
4. FTP Download file
Download the zip file that meets the date requirements from FTP based on the date and save it in the local specified downloads folder.
5. Unzip the zip file
Unzip the zip file to the specified decompression folder.
6, Data Cleaning integration
Call the Data cleansing processing integration transformation, capture the data in the zip file and clean, transform, process and load the data.
7. Delete Directory
Data Cleansing conversion Complete, the data loaded into the database after normal loading, the extracted directories and files are deleted, freeing up space.
8. Moving files
Transfer the zip file to the backup folder for storage, so that later data problems can be recovered or checked.
9. End
Ends the entire job, waiting for the job to start the next time.
2.2 Conversion Extraction Design ideas
2.2.1 Capturing time variables
Design ideas as shown in the process:
1. Get the system date and time
Get the current system yesterday 0 point time, even if this day is July 15, 2018 18:36 47 seconds, the time to capture should be 2018/07/14 00:00:00.000.
2. Date and time formatting
Format the above 2018/07/14 00:00:00.000 as 20180714.
3. Set the time variable
Saves the formatted date data as a variable in the variable passed in.
2.2.2 Data Cleaning Integration
The design ideas are as follows:
1. Capture Photo file name
2. Photo File parsing
Parses the photo file into binary data.
3, photo data cleaning integration
The data in the data flow is cleaned, transformed and integrated into the target data stream through the components.
4. Data loading
Load the destination data stream into the database for storage.
Third, data integration solutions
3.1 Job scheduling model
3.1.1 Job scheduling process
Diagram: Job scheduling model
3.1.2 Job Model Analysis
3.1.2.1 Start
3.1.2.2 Call time Variable transformation model
3.1.2.3 Check if the directory is empty
3.1.2.4 Jump Job
3.1.2.5 FTP Download
3.1.2.6 Unzip a file
3.1.2.7 Call Data Cleansing Integrated transformation model
3.1.2.8 Deleting an extract directory
3.1.2.9 Moving Compressed files
3.1.2.10 Job End
3.2 Integrated Transformation model
3.2.1 Integrated Conversion process
Diagram: Capture time variable conversions.
Diagram: Data Cleansing Integration transformation
3.2.2 Integration Conversion Process Analysis
3.2.2.1 Capture Time variable conversion
3.2.2.1.1 Get System Date Time
3.2.2.1.2 Date Time formatting
Script here
var sys_date = Date2str (today, ' yyyyMMdd ');
3.2.2.1.3 Setting the time variable
3.2.2.2 Data Cleansing Integrated Conversion
3.2.2.2.1 Get file name
3.2.2.2.2 Photo Conversion binary
/script here
var file = new Java.io.File (filename);
var fileinputstream = new Java.io.FileInputStream (file);
var Content = Org.pentaho.di.core.Const.createByteArray (File.length ());
Fileinputstream.read (Content,0,file.length ());
Fileinputstream.close ();
3.2.2.2.3 Split Field
3.2.2.2.4 string substitution
3.2.2.2.5 Field Selection
3.2.2.2.6 career
3.2.2.2.7 Data modification
Script here
var zp01 = zp_1;
var zp02 = zp_2;
var zp03 = zp_3;
var zp04 = Zp_4;
var zp05 = zp_5;
var zp06 = zp_6;
var zp01modifytime = Zp_1_lastmodifytime;
var zp02modifytime = Zp_2_lastmodifytime;
var zp03modifytime = Zp_3_lastmodifytime;
var zp04modifytime = Zp_4_lastmodifytime;
var zp05modifytime = Zp_5_lastmodifytime;
var zp06modifytime = Zp_6_lastmodifytime;
var ZPSCRQ = replace (dirname, ' _zp ', ');
var dirname1 = ' ftp://10.78.1.30/' + dirname + '. zip ';
if (zp01==null && zp02==null && zp03==null) {
zp01=zp04;
ZP02=ZP05;
zp03=zp06;
Zp01modifytime=zp04modifytime;
Zp02modifytime=zp05modifytime;
Zp03modifytime=zp06modifytime;
}
3.2.2.2.8 Getting System Information
3.2.2.2.9 Field Selection
3.2.2.2.10 Filtering Records
3.2.2.2.11 inserting updates
3.2.2.2.12 Discard Invalid data
Iv. Other Notes
No.
"Kettle" Data integration ftp download + local photo file import Oracle Database