ETL scheduling development (1) -- writing instructions, etl Scheduling
Preface:
During database operation and maintenance, files are often transferred between systems to perform operations such as data extraction, conversion, and integration. In addition, statistical scheduling is performed after data integration. Here, I will describe an ETL scheduling developed by myself in detail.
Requirement Overview:
Scheduling is divided into the following methods:
Ü one is SHELL-compiled ETL scheduling, which supports scheduling of hour, day, week, And month files.
Ü process scheduling is used to call a specified process based on the configuration table and execute it concurrently.
Process description:
The file scheduling process starts once every minute, and the subsequent processing process after file scheduling starts once every minute. The automatic warning process checks every 5 minutes.
Warning description:
When a daily file fails to arrive within a specified period of time, an SMS will be sent to the relevant personnel (the ETL scheduling program file does not arrive). Up to now, several exceptions have been added to the automatic processing program.
In addition, the statistics are stopped after a specified number of consecutive errors occur in the file process or statistical process, and an SMS warning is sent.
Design the corresponding configuration table and describe the relationship and configuration methods between the configuration tables.
What is the ETL scheduling system?
. 1 ETL process and scheduling design (ETL Schedule) (PSP)
? ETL scheduling Objectives
The quick effect system extracts data from four systems in 39 branches for processing. The data is transferred from files to ODS databases, ODS databases to LDM, And the PI value and summary PI are calculated, the entire ETL processing process must follow certain steps and meet certain conditions. Some key files, such as exchange rate data and institutional tables, will affect the processing of the entire ETL, the quick and effective ETL processing process is a complicated process.
Considering the diversity and complexity of ETL processing in the future, the concept of Job is introduced. The ETL processing process is divided into one Job, which may be cleaning, loading, or converting, or PI processing. In order to better schedule and monitor complex ETL processing processes, an ETL scheduling system is specially designed. Through the development and use of the ETL scheduling system, you can clearly and efficiently process the quick and effective ETL process through Job scheduling.
? Functions of ETL Scheduling
1. ETL scheduling function Diagram
2. ETL scheduling function description
Scheduling and Maintenance
1) The parameter maintenance of the scheduling system sets and modifies the public parameters of the scheduling system, including the number of periods, number of processes, data date, start date of the current period, and end date of the current period.
2) maintain the information of the downstream file, and maintain the name, status, date, and attribution of the downstream file in all regions.
3) Define and maintain the job step, define the actual ETL processing process corresponding to the job, generate the job number, define the job type and the driving relationship of the job, and the conditions required for running the job.
4) handle scheduling exceptions, handle exceptions during scheduling, and provide error search and re-run functions.
Log Management
1) scheduling process logs: Manages and records the main processes and exception information during scheduling, such as the logs of scheduling start, scheduling completion, database operation exceptions, and read/write file exceptions.
2) Job execution logs, manage logs that record Job execution information, and query, delete, and reset the execution status of the logs.
3) detailed event logs of jobs, manage and record detailed events during Job execution (number of cleansing records and specific database operations), and query and delete logs.
Job Scheduling
Under normal conditions, jobs are scheduled throughout the ETL process, and multipart submission and automatic submission are provided.
Schedulable Job types
1) C program (cleaning), ETL scheduling provides interfaces with C Programs, so that C programs can be scheduled.
2) use C-encapsulated SQLLDR to encapsulate oracle sqlldr in a C program for scheduling.
3) The PROC Program (merging and conversion) provides corresponding interfaces for the merging and conversion processes and scheduling to schedule the oracle proc program.
4) Stored Procedure (conversion): encapsulates the stored procedure in the PROC program for scheduling.
5) DataStage (PI processing), the scheduling system provides interfaces with DataStage to schedule various types of DataStage jobs.
? ETL overall Flowchart
? Function Type and data processing of the job step (ETL_Step)
0. file FTP: transfers the Source Business Systems (NLNS, SBS, and NACS) of each branch to the ETL server of the quick win project system over the BOC network in the predefined file format.
1. file registration: FTP source data file in 0. After decompression, it must be registered in the quick win project system, only after the file registration process can the quick win project system confirm that the "source data of those branches" has arrived correctly, and the ETL system can process the data accordingly.
2. data cleansing: source data files from FTP of each branch may contain illegal data or redundant data, or the data rule standards are inconsistent, in addition, the file format cannot be used immediately by the ETL process of the quick win project, therefore, data files must be cleaned (illegal and redundant data deletion, unified data rules and standards, and the ETL process converted to quick win project can "LOAD" the file format ).
3. Data Loading: load the cleaned data (File Format) through SQL ...... the remaining full text>
Why the etl tool? Can I manually write the script and run it?
First, we need to understand the differences between "etl", "etl tool", and "etl scheduling tool ".
"Script" corresponds to "task" instances in the etl Field
"Etl tools" usually correspond to "Job Scheduling" tools in the etl Field
Some tools, such as Microsoft's ssis, are also called etl tools. In fact, in large etl projects in China, such as bank projects, ssis is not used for projects, it is usually a professional etl scheduling tool provider outside China, such as control-m and datastage.
Of course, there is also an excellent scheduling product "taskctl" in China! It puts forward several new concepts:
Fundamentally solved the dilemma of "Difficult Process Monitoring expression" in ETL scheduling products in the current market!
It is a database-free product. from the perspective of deployment practice, it is very convenient. this reduces the difficulty of deploying the runtime environment (in some ETL scheduling products, it is also annoying to deploy the runtime environment)
The process "programming concept" was adopted for the first time to achieve rapid development of the scheduling process. (In some ETL scheduling products, the process is configured using the setting method. you may be able to configure dozens or hundreds of images. however, if tens of thousands of tasks are involved, they will be stretched)
Why the etl tool? The example of taskctl is equivalent to the process from "quantitative change to qualitative change!
Reference: wenku.baidu.com/...4.html