BI main link ETL related knowledge

Last Update:2016-05-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

BI Architecture-bi Key Links ETL related knowledge Main function: Load the data of the source system into the Data Warehouse and data mart layer; The main problem is the complex source data environment, including a wide variety of data types, huge load data volumes, intricate data relationships, and uneven data quality common terminology etl: Data extraction, conversion, loading (extract/ Transform/load) exf: Extracted files (Extract file) cif: Shared interface file (Common Interface file) PLF: Preloaded file (preload file) ldf: Load file (load File) DW: Data Warehouse) DM: Data marts GC: A common interface filegroup (CIF Group), a program that combines a pair of ex (extraction) and CV (transform) programs ge: Entity Group Group), a program that combines TR (conversion) and LD (load) programs ETL function Architecture:
It can be seen that the architecture can be divided into three parts 1, the management schedule according to the Target data table update cycle and the source data ready time, the development of daily data ETL schedule. The administrator through the ETL tool job scheduling function to run time settings, so that the ETL tool automatically when the specified conditions to start the corresponding ETL job. Each Target data table ETL process corresponds to a set of sequentially executed entity jobs (including conversion jobs and load jobs) that form a sequence (Sequence), the ETL process for each CIF (common interface file) corresponds to a sequence of sequential execution of CIF jobs, including extraction jobs and transformation jobs. These ETL jobs will be in each of the steps, namely extraction, transformation, transformation, loading and other ETL function modules organically. Job scheduling is to control the operation of the ETL process by linking the activities of the CIF logic and the entity logic with the corresponding relationship between the GC (CIF Group) and GE (entity Group). 2. Application function The ETL Application module hierarchy contains the programs that implement each ETL step, and the procedures for merging and setting dependencies on these steps, i.e. Extract (Extract), transform (convert), transform (Transform), load program。 Each module implements a specific function, detailing the following:  data extraction (Extract)  Data Transformation (convert/clean)  Data conversion (Transform)  Data load Each stage takes data files as interface, i.e. data extraction (EXTRACT) stage reads data source to generate EXF data, Data Transformation (converting/ Cleansing) Stage reads EXF data to produce CIF data, the data conversion (Transform) stage reads CIF data to produce LDF data (if there is a preload process that may also produce intermediate PLF data), The data load (load) phase reads the LDF data into the Data warehouse or data mart. The data extraction, conversion and loading are separated by the CIF format as a bridge between the target table and the data source, so that each function is relatively independent, reducing the coupling between each function, while the function of each module is subdivided, the logic is simpler, Easier to control development errors and improve development efficiency. In addition, it facilitates error tracing and abnormal recovery during system operation. In order to verify the correctness of the ETL-loaded data, depending on the business requirements, the data balance check (Amount Balance check) may need to be performed in the Data Warehouse or data mart, which must be done after the relevant ETL table data is loaded. ETL There is no call between the modules, the possible relationship between them is only one module output file is another module needs to read and processing files. An ETL function module is an ETL job, each function module logic is only a relatively independent part of the ETL process. 3, control environment &NBSP;ETL function module operation needs to be controlled by the corresponding parameters, while there are many control files between the modules and call some public functions, the function module in the process of operation may produce rejected files, The operation of the function module will produce some monitoring information and so on, these for the operation of the ETL function module to control and support the environment and the corresponding maintenance management procedures constitute the ETL architecture environment Two layers of the ETL application, ETL control environment for each of the above levels of support for each application, the application level of independent functional modules are connected through a more hierarchical logical relationship, so that each module function clearer, clear ETL Mode  Full refresh (refresh,type 1): The Database data table contains only the most recent data, each time it is loaded, the original data is deleted, and the latest source data is fully loaded. This mode is used for loading most parameter tables. In this mode, the data extractor extracts all the records from the source data, empties the target data table before loading, and then loads all the records. In order to improve the speed of data deletion, it is generally used truncate to clear the data table without using SQL Delete.  mirroring Increment (Snapshot append,type 2): The records in the source data are updated regularly, but the record includes the record Time field, the record of the data history is saved in the source data, and the ETL can extract the incremental data from the source data to load into the database by recording time, and the history of the data will be kept in the database.  Events Increment (Event Append,type 3): Each record is a new event, there is no inevitable connection between the new record is not the value of the original record changes, records include Time fields, you can use the Time field to extract the new data to load into the database  Mirror Comparison (Snapshot Delta,type 4): The Data warehouse data has an effective date field to hold historical information about the data, and the source data does not retain history and may be updated on a daily basis. Therefore, you can only compare the new mirrored data with the image of the last loaded data, find the Change section (delta), update the effective end date of the record where the historical data was updated, and add the changed data. ------------------------------------------------------------------------------------------ Data Extraction (Extract) data extraction is the process of getting the data you need from a data source. The main tasks of data extraction are: Data range filtering, full extraction of all records from the source table or incremental extraction by a specified date  Extract the field filter, all the source table all the fields or include filter out the unwanted source data field  Extract the condition filter, such as filter to the specified conditions of the record  data sorting, such as according to the selected field to sort data extraction can be used Pull and pushTwo different ways. Push refers to the source system in accordance with the data format defined by the two parties, actively extract the data that meets the requirements, form an interface data table or data View for ETL system use. Pull is the way that the ETL program accesses the data source directly to get the data. Data Transformation (Convert) The task is to record the inspection data, the conversion of each field to conform to the Data Warehouse standard data format, that is, the data type and data format conversion, and the empty field is given the appropriate default values, a structured data structure, for non-compliant data, write to reject files (reject file). The main work of data transformation is:  Format transformation, such as all date format Unified for the Yyyy-mm-dd; default value, in the Data warehouse to define the value of the field in the source data corresponding to the field may have no value records, then according to business needs, there may be two ways to deal with, One is to write the record to the reject file, the business unit checks and patches the source data according to the reject file, and the other is to assign a default value directly in the convert phase, and the type transformation, such as converting the number type of the source system to the VARCHAR2 type, After some fields have been upgraded by code, the old code is converted to new code, and so on.  Value conversion, such as the unit of value from million to Yuan.  Remove space, remove the trailing space in the character type data Data Conversion (Transform)According to the data structure of the target table, the fields of one or more source data are translated, matched, aggregated and so on to get the fields of the target data. Data transformation mainly includes format and field merging and splitting, data translation, data matching, data aggregation and other complex computations.  Field merging and splittingA field merge is a field in which multiple fields of the source data are merged into the target data. Field splitting refers to splitting a field from one table in the source data into multiple fields of the target data.  Assign default valueFor the fields in the database, there is no corresponding source field in the source system, and depending on the design of the model, a value may need to be assigned by default. This may involve more complex data remediation rules than the Assign Defaults feature in the Data transformation phase.  Data sorting (sort)Converters sometimes need to be merged with two or more CIF files, and the CIF file needs to be sequenced according to the required key values before merging, so that the speed of the merge can be accelerated, and the sequencing process takes place before transform.  Data lookup (lookup)Translate the code in the source system, such as the State, type, and so on, directly to the meaning that it expresses, or vice versa. Data translation needs to use the reference table (Reference table), the data reference table is usually a dictionary table or according to the source data and the definition of the target data manually generated, if the data translation in the reference table can not find a corresponding control, according to business rules, the corresponding record needs to be reject or assigned the default value.  Data merging (merge)Merge the data according to a certain condition (usually the key value is equal), find the records that describe the same object in different data tables, and link the records. Data merging is actually a special case of data lookup, mainly used in the case of the data volume is very large, the data merge in the implementation way generally first to the two tables to be merged (sort), and then sequentially to the records of two tables to match merge, this can greatly speed up the processing speed.  Data aggregation (Aggregate)The data according to the different groups to summarize statistical calculations, generally used for the calculation of summary tables, the main types of aggregation are:  sum  averaging  to find the minimum number of records   to find the maximum value in principle, ETL only deal with the law and repeatability Large data aggregation, such as summarizing, averaging, finding maximum and minimum values, and not for complex computations, to reduce development costs and system load. For irregular and complex computations, it should be done either by the source system side or by developing a specialized computational program (such as a stored procedure) on the data warehouse side that is called after the ETL is loaded.  Comparison of files (file Compare)Corresponding to four ETL modes, data conversion to PLF file needs to be processed differently, in Type1, Type2 and Type3 mode, the generated LDF file can be loaded directly by the data loading process into the Data Warehouse, and Type4 due to the existence of a valid date, You need to compare the day's snapshot data (PLF file) with the historical data image (PLF file), identify the records that need to be added and need to be updated, and then produce an LDF file that can actually be loaded into the database, and then load the PLF file into the data warehouse by the data loading process.  Proxy key value (surrogate key) assignmentFor the design of the proxy key in the data warehouse, the allocation of surrogate key values generally have two methods, one is in the database through the function of the database to automatically increase the type, one is the ETL process to assign the key value, the surrogate key value is generally numeric type, and must ensure that the key value assignment is not duplicated. The latter is taken in this project.  Slow incremental gain (Slow change capture)For slowly changing dimension tables (zipper tables) that contain fields such as effective date, cutoff date, and if the source cannot provide incremental information (including through timestamps), the ETL needs to generate the PLF file for the snapshot data of the day, and then compare it with the data mirroring of the previous days (obtained from the target data snapshot area). Identify the records that need to be added and updated, and then generate the LDF files that can actually be loaded into the database.  Row and column conversion (PIVOT)According to some special requirements of the model, it is necessary to convert the horizontal table of the source system into a vertical table or a table into a horizontal table, if the source table is designed to be 12 fields in 12 months, and the target table needs to be replaced by one-month fields, and a record of the source table will be converted to 12 records by month. ri Check (RI check)For tables that have RI relationships, the RI is checked to reject data with RI problems. Because the Greenplum database used in this project cannot perform foreign key constraints, using ETL to check RI is a critical point to ensure data quality.  Other complex calculationsSome of the fields defined in the database need to be complex to calculate according to business rules, there are two main categories: 1. Mainly for some in the data source can not find a direct corresponding field, need to be in the ETL process through the relevant fields calculated in order to draw the field; 2. In principle, the complex calculation is not done in the ETL, for a certain need to be completed by the ETL of the complex calculation field, the ETL load when the field is left blank, after the completion of the load call stored procedures to calculate, but for the unification of management and scheduling, you can call stored procedures in the ETL scheduling tool to unified scheduling. data Load (load)The structure of the PLF file generated by the data transformation is exactly the same as the structure of the database data table, and can be loaded into the Data Warehouse bulk load directly through the data loading tool. The data loading work will be divided into 3 steps.  Pre-load (pre-load)Depending on the actual table being loaded before the data is actually loaded, primarily for detail and large fact tables, performance considerations may also require the following preparation:  Delete the index of the data table in the Data Warehouse.  Delete the primary key.  Loading (load)Load mainly completes loading the data of the PLF file into a table in the database and requires three kinds of loading methods: insert: Simply insert all the data from the PLF file into the target table. upsert: Need to do update and insert on the target table at the same time, according to the primary key, the existing records for the update operation, for non-existent records do insert operation, for the data volume of the table, due to the efficiency of this operation is very low, can be used to first split the PLF file into Delete files and insert files, and then first remove the records in the delete file from the Data warehouse according to the primary key pair, and then insert all the records from the insert file into the target table. refresh: The data for the target table is fully updated, the general practice is to truncate the data of the target table before fully insert the records to be loaded  Post-load (post-load)The post-load phase mainly completes the related environment finishing work after the data loading completes, mainly includes the following work:  rebuild the index: If you do a Delete data table index action in the pre-load phase, you need to rebuild in the post-load phase.  re-create the primary key: If you do a delete primary key action in the pre-load phase, you need to rebuild in the post-load phase.  file cleanup: Delete unnecessary temporary files, such as empty deny files, etc.  generate post-load Data key files: After the data load is completed, if the subsequent ETL process needs to be based on the table for RI inspection, you need to extract the table's primary key from a separate key file.  Extract the data file: If the loaded data table is subsequently loaded by other tables in a large number of lookup (lookup), you can consider extracting the file. ETL job dependencies and process schedulingETL Job has a lot of dependencies, related to the loading order of data tables, this design from two aspects of the ETL job to reflect the dependency relationship: first, each job group should be produced after the completion of the job name corresponding to the message file, for the subsequent need to rely on the job to complete the job group, You should first wait for the message file to appear before continuing. The second is to combine a set of jobs that start running at the same time and have dependencies, and develop a network-relational sequence job (Sequence job) for ETL process schedulingETL Job Process scheduling function is simple, is to start the program at the specified time, and record the operation of the system and the results of the operation. different data tables have different update cycles, so process scheduling needs to be able to support a variety of different boot cycles, such as day-week, and to determine when each task starts running by setting a startup time. only routine data loading requires consideration of process scheduling issues. And for the initial data and the loading of historical data, because it is a one-time work, will take the manual start loading method, so there is no need to establish the initial data and historical data loading institutionalized process scheduling. before the ETL program extraction process must ensure that its extracted source data is ready, this project uses Oracle Golden Gate in the data buffer to maintain a copy of the source system source table, the time of the Night Batch processing window at the specified times to capture the unified snapshot, Load the snapshot area (detailed in the 3rd chapter). Upload data manually on the day of upload to receive data quality audits, and loaded into the transition table. However, the flow of such data into the Downstream data mart is only combined with the system source data in the Night Run window. The manual data uploaded on that day is in principle only in the next day from the front-end application.
Logical Deployment Section
Broadly it appears to be divided into three layers user layer---User access three ways to mobile, web-side, System user application layer---Load balancer: To support the balanced application load of BI related service requests, improve the usability of the application. Load balancer device installed on two or more servers nbsp it will Request to the server with the lightest load Data Application Server: The Data assembly section of the entire BI application architecture, supporting the efficient use of cockpit, indicator query, thematic analysis, strategic management, etc. , &NB Sp secure data application services to meet the requirements of high availability and parallel scaling of performance. , &NB Sp Darbond to deploy in a clustered way Data Integration Server, support data extraction, data transfer Data audit, servers to meet the requirements of high availability and performance parallel scaling Systems Management Server support systemManagement of control and user rights Web server support the Web page Access service. Static content, including html,javascript,img and JPG images,Servers to meet highly available requirements The data resource layer contains buffers, TDR, data warehouses and fairs, file libraries, log libraries, and backup libraries
system Overall Module flow
Module dependency Relationship

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

BI main link ETL related knowledge

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

BI main link ETL related knowledge

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support