ETL scheduling development (5) -- connect to the database to execute database command subroutines and etl Scheduling
In ETL scheduling, you need to connect to the database to read and write data. The following subprograms use the input database connection string and database commands (or SQL) to perform the required operations:
#!/usr/bin/bash#created by lubinsu#2014source ~/.bash_profilevalues=`sqlplus -S $1 << EOF set heading off feedback off pagesize 0 verify off echo off WHENEVER SQLERROR exit SQL.SQLCODE ROLLBACK $2 commit; exit EOF`echo "${values}"
The parameters are database connection strings, database commands (or SQL statements) in sequence)
What is the ETL scheduling system?
. 1 ETL process and scheduling design (ETL Schedule) (PSP)
? ETL scheduling Objectives
The quick effect system extracts data from four systems in 39 branches for processing. The data is transferred from files to ODS databases, ODS databases to LDM, And the PI value and summary PI are calculated, the entire ETL processing process must follow certain steps and meet certain conditions. Some key files, such as exchange rate data and institutional tables, will affect the processing of the entire ETL, the quick and effective ETL processing process is a complicated process.
Considering the diversity and complexity of ETL processing in the future, the concept of Job is introduced. The ETL processing process is divided into one Job, which may be cleaning, loading, or converting, or PI processing. In order to better schedule and monitor complex ETL processing processes, an ETL scheduling system is specially designed. Through the development and use of the ETL scheduling system, you can clearly and efficiently process the quick and effective ETL process through Job scheduling.
? Functions of ETL Scheduling
1. ETL scheduling function Diagram
2. ETL scheduling function description
Scheduling and Maintenance
1) The parameter maintenance of the scheduling system sets and modifies the public parameters of the scheduling system, including the number of periods, number of processes, data date, start date of the current period, and end date of the current period.
2) maintain the information of the downstream file, and maintain the name, status, date, and attribution of the downstream file in all regions.
3) Define and maintain the job step, define the actual ETL processing process corresponding to the job, generate the job number, define the job type and the driving relationship of the job, and the conditions required for running the job.
4) handle scheduling exceptions, handle exceptions during scheduling, and provide error search and re-run functions.
Log Management
1) scheduling process logs: Manages and records the main processes and exception information during scheduling, such as the logs of scheduling start, scheduling completion, database operation exceptions, and read/write file exceptions.
2) Job execution logs, manage logs that record Job execution information, and query, delete, and reset the execution status of the logs.
3) detailed event logs of jobs, manage and record detailed events during Job execution (number of cleansing records and specific database operations), and query and delete logs.
Job Scheduling
Under normal conditions, jobs are scheduled throughout the ETL process, and multipart submission and automatic submission are provided.
Schedulable Job types
1) C program (cleaning), ETL scheduling provides interfaces with C Programs, so that C programs can be scheduled.
2) use C-encapsulated SQLLDR to encapsulate oracle sqlldr in a C program for scheduling.
3) The PROC Program (merging and conversion) provides corresponding interfaces for the merging and conversion processes and scheduling to schedule the oracle proc program.
4) Stored Procedure (conversion): encapsulates the stored procedure in the PROC program for scheduling.
5) DataStage (PI processing), the scheduling system provides interfaces with DataStage to schedule various types of DataStage jobs.
? ETL overall Flowchart
? Function Type and data processing of the job step (ETL_Step)
0. file FTP: transfers the Source Business Systems (NLNS, SBS, and NACS) of each branch to the ETL server of the quick win project system over the BOC network in the predefined file format.
1. file registration: FTP source data file in 0. After decompression, it must be registered in the quick win project system, only after the file registration process can the quick win project system confirm that the "source data of those branches" has arrived correctly, and the ETL system can process the data accordingly.
2. data cleansing: source data files from FTP of each branch may contain illegal data or redundant data, or the data rule standards are inconsistent, in addition, the file format cannot be used immediately by the ETL process of the quick win project, therefore, data files must be cleaned (illegal and redundant data deletion, unified data rules and standards, and the ETL process converted to quick win project can "LOAD" the file format ).
3. Data Loading: load the cleaned data (File Format) through SQL ...... the remaining full text>
Some people have concluded that "ETL is the foundation of BI (business intelligence can) and scheduling is the soul of ETL". What is the reason?
ETL is the process of data extraction (Extract), Cleaning (Cleaning), transformation (Transform), loading (Load)
You think, the origins of data are all implemented by ETL. In the future, all data processing will not rely on the extracted data.
This piece is not well done. The analysis behind it shows that it is a great god, and he has no way, just like a clever man.
Therefore, ETL is the foundation of BI business intelligence,
Scheduling is the soul of ETL. let's first talk about the scheduling function.
2. ETL scheduling function description
Scheduling and Maintenance
1) The parameter maintenance of the scheduling system sets and modifies the public parameters of the scheduling system, including the number of periods, number of processes, data date, start date of the current period, and end date of the current period.
2) maintain the information of the downstream file, and maintain the name, status, date, and attribution of the downstream file in all regions.
3) Define and maintain the job step, define the actual ETL processing process corresponding to the job, generate the job number, define the job type and the driving relationship of the job, and the conditions required for running the job.
4) handle scheduling exceptions, handle exceptions during scheduling, and provide error search and re-run functions.
Log Management
1) scheduling process logs: Manages and records the main processes and exception information during scheduling, such as the logs of scheduling start, scheduling completion, database operation exceptions, and read/write file exceptions.
2) Job execution logs, manage logs that record Job execution information, and query, delete, and reset the execution status of the logs.
3) detailed event logs of jobs, manage and record detailed events during Job execution (number of cleansing records and specific database operations), and query and delete logs.
Job Scheduling
Under normal conditions, jobs are scheduled throughout the ETL process, and multipart submission and automatic submission are provided.
Schedulable Job types
1) C program (cleaning), ETL scheduling provides interfaces with C Programs, so that C programs can be scheduled.
2) use C-encapsulated SQLLDR to encapsulate oracle sqlldr in a C program for scheduling.
3) The PROC Program (merging and conversion) provides corresponding interfaces for the merging and conversion processes and scheduling to schedule the oracle proc program.
4) Stored Procedure (conversion): encapsulates the stored procedure in the PROC program for scheduling.
5) DataStage (PI processing), the scheduling system provides interfaces with DataStage to schedule various types of DataStage jobs.
You can see whether the most difficult issue is basically handled by scheduling. Therefore, scheduling is the soul of ETL.
Source: business intelligence and data warehouse enthusiasts
Provide, business intelligence, cloud computing, and ETL