ETL specification Overview 1.1 meaning: ETL is the abbreviation of extract, transform, and load. Data extraction: the process of obtaining the required data from the data source. The Data Extraction Process filters out the source data fields or data records that are not required in the target dataset. Data conversion: based on the data structure of the target table, the fields of one or more source data are translated, matched, and aggregated to obtain the fields of the target data. Data conversion mainly includes format conversion, field merging and splitting, data translation, data matching, and other complex computations of data aggregation. Data Loading: load data to the target database. 1.2 ETL application: the complete ETL application process consists of three phases: design phase: analyze the data structure of source and target datasets, and define reasonable data conversion logic. Implementation Phase: Coding Based on the logic rules developed in the design phase to implement the E, T, and l processes of data. Maintenance Phase: for non-one-time data integration projects, the ETL process needs to be repeated and maintained and improved continuously. 1.3 purpose of standardized formulation: ETL specifications are some rules and methods defined to ensure the correct design, implementation and maintenance of ETL, including ETL design specifications, development specifications, and maintenance specifications. 1.4 Design Specifications design specifications are mainly used in the early stages of ETL coding. In this phase, multiple Mapping documents on data streams should be formed at different layers. 1.4.1 Mapping should include the following parts: relevant attributes of the data source, including the Object Name-including DSN, owner, and other information; field name-English name; field description-Chinese name, for example, the parameter information should have relevant values, such as gender fields (1: Male; 2: female; 0: Unknown); Type-field type, including length and accuracy information; non-empty attributes-whether the field can be blank; related attributes of the target dataset, including the Object Name, including DSN, owner, and other information; field name-English name, it is recommended to name according to the field meaning, instead of defining fields simply in pinyin (this part is controlled by the person responsible for designing the dataset); field description-Chinese name, default value should be given for reserved fields; type-field type, contains length and accuracy information; non-empty attributes-whether the field can be blank; rules, mainly describes the conversion rules of each part of ETL, including: data Source filtering rules-describe the rules for filtering records when retrieving data from the source dataset; Association Rules-Describe the associations between multiple source datasets; column conversion rules-describes the conversion rules (business logic related) between fields from the source dataset to the target dataset; number of targets Data Set Update rules-update policies that describe the target dataset, including the update mechanism and frequency, such as daily full update and weekly incremental update; 1.4.2 ETL job list-Jobs developed by ETL contain certain business logic and encoding logic, so the scheduling process should follow a certain logical sequence, including: job name-name of the job that implements mapping, including the description of the job function; scheduling sequence-describes the job scheduling sequence in serial number or flowchart mode, the business logic, encoding logic, and system resources should be taken into account comprehensively. On the basis of ensuring the business logic and encoding logic, the system resources should be used to the maximum extent through control and scheduling; parameter List-lists the parameters used in each job. It is best to use the same name for the same parameters in different jobs to facilitate control during scheduling; 1.4.3 The version management ETL design will change with the in-depth understanding of the business and system and the changes in the structure framework. Therefore, the mapping design should also be updated simultaneously. In the development process, you must strictly abide by one rule: when a rule changes, you must change mappig before changing the corresponding job design. In terms of mapping change management, a detailed version change record should be provided to track the changes in ETL development. The change record includes the following content: version-a new version number should be provided for each change; author-change person; Update Time-change time; update content-Brief description of the change content; remarks-it can be used to record the reasons for changes and other relevant information. 1.5 Development specifications should follow certain development specifications in the development process to ensure smooth transition and smooth handover during various stages of project development. It mainly includes: naming rules, functional definition specifications, and structural specifications 1.5.1
Naming rules: 1.5.1.1 job naming rules logical layer _ Business Domain _ target table [_ all] [_ Sn] case1: If a job involves only one target table, the job is named logical layer _ Business Domain _ target table [_ Sn]. Case2: If a job involves two target tables and the two target tables are in the master-slave table relationship, the job is named logical layer _ Business Domain _ Target Master table name_all. Case3: If two jobs with the same names meet the rules defined in 1 and 2, they are distinguished by serial numbers (definition: 01,02 ,...). 1.5.1.2 The Link naming rules in the Job give the name of the stage and link in the job. In addition, add comments to the job design interface, it mainly includes job function descriptions, modules, development time, and developers.
Elements |
Description |
Example |
Stages |
Name_desc |
Db2_product, lkp_product |
Links |
Stagetosatge |
Lkptoloud, sorttodb2 |
Connection Type |
|
Native |
Stage_databse |
Db2_dc |
ODBC |
Odbc_dbtype_odbcname |
Odbc_iq_tdc |
1.5.1.3 file definition: during job design, files are inevitably used as intermediate storage or final result output. In datastage, sequential file stage is used to define sequential files. The sequential file stage is usually defined according to the following rules: File Name: logical layer _ Business Domain _ function _ filename, where: function: stores files based on their purpose, for example, exf (extraction), LD (loading), rej (rejection) 1.5.1.4 the ETL parameter specification parameters and variables are named in uppercase, And the words or sketches are underlined ("_"). connection. There should be no space after each word, followed by the parameter name = parameter value, one line of a parameter. Shows the parameter file.
Parameter settings: case1: Create parameter set: add the parameters required for this project in the parameter set for each job to call. Case2: when a job is running, the specific parameter file name is passed to the parameter set in the sequence job. In general, parameters in a project are divided into the following situations: case1: parameters that change slowly or remain unchanged. This type of parameter can be divided into the following three types: case1.1: The app_code field used to identify the record source and the database access mode name. Case1.2: database connection parameters. Case1.3: some text file paths temporarily output. Case2: parameters with changing parameter values. This type of parameters can be divided into the following two types: case2.1: control the start and end time of data extraction case2.2: deal_case1.1: define para_set_appcode and para_set_schema. Para_set_appcode defines all parameter names and parameter values used to identify the source system. Para_set_schema defines all schema names used to access the database. The value file for para_set_appcodeis para_set_appcode.txt. The value File para_set_appcode.txt is stored in the IIS installation directory: $ project_name \ parametersets \ para_set_appcode \ para_set_appcode.txt. Parameters in para_set_schema
Specifies the mode name for accessing the database of each data source,The value file for para_set_schemais para_set_schema.txt. The value File para_set_schema.txt is stored in the IIS installation directory: $ project_name \ parametersets \ para_set_appcode \ para_set_appcode.txt. Deal_case1.2: how to connect to the database: define the data connection object in datastage. when defining the parameters in the data connection object, assign the parameter value and load the corresponding connection object in the job. Because there are few data connection objects, we do not use passing parameter values to maintain user names, passwords, and other related parameters in the data connection objects. Deal_case1.3: path information processing method of some text temporarily output: we define the path information to the environment variable. Define two environment variables ds_file_temp_dir and ds_file_error_dir. The value of the ds_file_temp_dir variable is the path to the IIS installation directory: $ project \ ds_file_temp, where the temporary intermediate file generated during the running of the stored job. You must first create the ds_file_temp folder in the corresponding directory. The ds_file_temp_dir variable is the path under the IIS installation directory: $ project \ sdetl \ ds_file_error, which stores some error information about data quality during job running. The error message about data quality is output to text when designing a job. You must first create the ds_file_error folder in the corresponding directory. Method for processing the second type of parameters: deal_case2.1: Define the parameter set para_set _ Business Domain _ data. Para_set _ Business Domain _ data defines all date-related parameters. The parameter values are provided through the parameter set value file. Each business domain corresponds to a parameter value file, which is used to control the time parameter settings of the business domain separately. The parameter set value file corresponding to each business field is automatically generated through etljob. Deal_case2.2: depends on the specific project business. 1.6
It is usually a good way to obtain relevant log information and record it in the database after the ETL operation is completed. For datastage-based development, you can provide a complete set of log processing solutions to obtain log information.) The code is developed to write logs back to the database, you can make simple modifications to import logs to containers such as databases to generate maintenance reports. 1.6.1 log file report summary report: seq job, including the number of jobs, number of warning jobs, number of successful jobs, number of failed jobs, start time, end time .... Detailed report: seq job name, job name, start time, end time, running status ....
1.6.2 click the job name in the job maintenance document to provide the job mapping design.
1.6.2 backup, recovery, and Version Control
ETL system backup is mainly used to back up the ETL running environment.
Running backup means to ensure that the ETL can be completed through the backup ETL system when the running ETL system crashes. To achieve this purpose, two datastage environments should be installed, and create the same configuration, one of which is in the running state, and the other is in the STANDBY state. After daily ETL is completed, all files in the running environment are backed up to dump the ETL running directory to a plug-in disk or an external storage media.
ETL system recovery mainly refers to operation recovery.
Operation Recovery refers to enabling the backup operation system to continue when the operating system encounters a serious fault, such as a hardware fault or an operating system crash, and cannot be repaired in a timely manner, restore the ETL environment backed up on the previous day to the standby system, and then start the standby system to run the daily ETL.