The data increment extraction mechanism in ETL

Source: Internet
Author: User

The data increment extraction mechanism in ETL(

Incremental extraction is an important consideration in the implementation of Data Warehouse ETL (extraction,transformation,loading, data extraction, transformation and loading). In ETL process, the efficiency and feasibility of incremental updating is one of the key problems of ETL implementation, and the incremental updating mechanism of ETL is more complicated, which often depends on the type of source data system and the requirement of incremental update performance.

1 ETL Overview

ETL includes data extraction, transformation, loading. ① Data extraction: Extracting data from source data source system: ② Data Conversion: The data obtained from the source data source is converted into the form of the requirement of the destination data source according to the business requirement, and the error and inconsistent data are cleaned and processed; ③ data load: Loads the transformed data into the destination data source.

ETL, as a link of building data Warehouse, is responsible for extracting data from distributed and heterogeneous data sources, such as relational data and planar data files, to the temporary middle layer for cleaning, transformation, integration and finally loading into Data warehouse or data mart, which is the basis of on-line Analytical processing and data mining. ETL Original main user to build data warehouse and Business Intelligence Project, now also more and more used in general information system data migration, exchange and synchronization.

In the 3 links of ETL, the data extraction directly faces a variety of scattered and heterogeneous data sources, how to ensure the stable and efficient extraction of the correct data from these data sources, is one of the key issues to be considered in ETL design and implementation process.

When the data is initialized on the integration side, it is generally necessary to load all the data from the data source side, when the whole amount of extraction is needed. Full-volume extraction is similar to data migration or data replication, which extracts the data from a table or view in the data source from the database, and then makes subsequent transformations and loading operations. Full-volume extraction can be done using data replication, import or backup, the implementation mechanism is relatively simple. After the full-volume extraction is complete, the subsequent extraction operation simply extracts the data that has been added or modified in the table since the last extraction, which is the incremental extraction.

In a database repository, whether it is a full-scale or incremental extraction, extraction is typically done by data warehousing tools such as Oracle's OWB,SQL server's Integration Services and professional ETL business products Informatica Powvrcenter and so on. If your business has a limited budget, you can also consider using Open source project Pentaho. These tools have a special feature, that is, they do not implement a specific incremental extraction mechanism, after they complete the full-scale extraction, the user can be customized by the way of scheduling tasks to extract the current period from the source system to generate incremental data, but as to how these incremental data generated, the tool does not There are features that provide automatic generation of incremental data. Therefore, the production mechanism of incremental data in ETL process is a problem that needs to be researched and chosen by users.

2 incremental extraction mechanism

To achieve incremental extraction, the key is how to capture the changed data accurately and quickly. The excellent incremental extraction mechanism requires ETL to accurately capture the change data in the business system at a certain frequency, while not causing too much pressure on the business system and affecting the existing business. The design of the incremental extraction is more complicated than the whole-volume extraction, and there is a kind of ETL design idea that automatically transforms the whole-amount extraction process into the incremental extraction process, if the change data must be captured, and the method of capturing change data commonly used in incremental data extraction is as follows:

2.1 trigger mode

The trigger mode is an incremental extraction mechanism commonly adopted. The method is based on the extraction requirements, on the source table to be extracted to insert, modify, delete 3 triggers, whenever the data in the source table changes, the corresponding trigger will change the data to a Delta log table, ETL incremental extraction is from the Delta Log table instead of directly in the source table to extract data, At the same time, the extracted data in the Delta Log table is marked or deleted in time.

For simplicity, the incremental log table does not typically store all the field information for the delta data, but only the source table name, the updated keyword value, and the update operation type (INSERT, update, or delete), the ETL incremental extraction process first based on the source table name and the updated keyword value. The corresponding complete record is extracted from the source table, and the target table is processed accordingly according to the type of update operation.

For example, for a database with a source table of Oracle type, the process of incremental data capture by means of a trigger is as follows:

(1) Create an incremental log table DML log:

CREATE TABLE Dml_log (

ID Number primary key,--self-increment primary key

TABLE_NAME VARCHAR2 (200),--source table name

RECORD_ID primary key value for number,--source table increment record

Dml_type CHAR (1),--increment type, i means new; u means update; d means delete

EXECUTE Date Date--Occurrence time

);

(2) Create a sequence seq_dml_log for Dml_log so that the trigger writes the Delta Log table when the ID value is generated.

(3) For each table you want to listen to, create a trigger, such as creating a trigger on the table test as follows:

Create or replace trigger T before INSERT or UPDATE or DELETE on T for each row

DECLARE I_dml_type varchar2 (1);

Begin

If INSERTING then l_dml type:= ' I ';

elsif UPDATING then i_dml_type:= ' U ';

elsif DELETING then l_dml_type:= ' D ';

End If;

If DELETING Then

Insert into Dml_log (Id,table_name,record_id,execute_date,dml_type)

VALUES (seq_dml_log.nextval, ' Test ',: old.id,sysdate,l_dml_type);

Else

Insert into Dml_log (Id,table_name,record_id,execute_date,dml_type)

VALUES (seq_dml_log.nextval, ' Test ',: new.id,sysdate,l_dml_type);

End If;

End

In this way, all DML operations on table T are recorded in the Delta Log table Dml_log, noting that the Delta Log table does not fully record the delta data itself, only the source of the incremental data is recorded. When you make an incremental ETL, you only need to reverse the source table to get real incremental data based on the record in the Delta log table.

2.2 time stamp mode

When the timestamp method refers to incremental extraction, the extraction process determines which data to extract by comparing the system time with the value of the timestamp field of the source table. This approach requires adding a timestamp field to the source table and modifying the value of the timestamp field while updating the table data in the system.

Some databases (such as SQL Server) have timestamps that support automatic updates, that is, when data changes in other fields of the table, the value of the timestamp field is automatically updated to the moment the record changes. In this case, the ETL implementation only needs to add a timestamp field to the source table. For databases that do not support automatic timestamp updates, this requires the business system to manually update the timestamp field programmatically when updating business data.

The use of timestamps can normally capture the insert and update operations of the source table, but there is nothing to do with the deletion, which needs to be done in conjunction with other mechanisms.

2.3 Full table Delete Insert mode

Full table Delete Insert means delete the target table data before each extraction, and load the data completely. This approach actually equates an incremental extraction with a full-volume extraction. For a small amount of data, the time cost of full-scale extraction is less than the algorithm and the conditional cost of performing the incremental extraction, which can be used.

2.4 Full table comparison method

In the case of an incremental extraction, the ETL process compares the records of the source table and the target table, and reads the new and modified records.

After the optimization of all the alignment method is the use of MD5 check code, you need to extract the table in advance a structure similar to the MD5 temporary table, the temporary table records the primary key value of the source table and the data from the source table all the fields calculated MD5 check code, each time the data extraction, Compare the MD5 checksum of the source table and the MD5 temporary table, if there is a difference, update operation: insert operation if the primary key value is not present in the target table, indicating that the record has not yet been made. Then, you need to perform a delete operation on the primary key value that is not already present in the source table and the target table remains.

2.5 Log table mode

For the production database that establishes the business system, the business Log table can be created in the database, and the Maintenance log table content is updated by the corresponding Business System program module when the business data that needs to be monitored changes. When incremental extraction, read the log table data to determine which data to load and how to load it. The maintenance of the log table needs to be done by the business System program in code.

2.6 System Log Analysis Method

This method can judge the changing data by analyzing the log of the database itself. Relational plow the database system will store all DML operations in the log file for the database backup and restore functions. ETL incremental extraction process through the analysis of the database log, extract the relevant source table after a certain time after the DML operation information, you can know since the last time the table data changes, so as to guide the incremental extraction action.

Some database systems provide a dedicated package of access logs (such as Oracle's Logminer), which greatly simplifies the parsing of database logs.

How to 2.7 a specific database

Incremental extraction method for a unique database system:

2.7.1 Oracle change data capture (changed, CDC) Way

The ORACLECDC feature is introduced in the Oraele9i database. The CDC can help identify data that has changed since the last extraction. With CDC, data can be extracted at the same time as an INSERT, update, or delete operation on the source table, and the changed data is stored in the database's change table. This allows you to capture the changed data and then use the database view to provide the ETL extraction process in a controlled manner as a basis for incremental extraction.

The CDC approach captures the changes to the source table data in two ways: synchronous CDC and asynchronous CDC. Synchronous CDC uses the source database trigger to capture the changed data. This approach is real-time, without any delay. When the DML operation commits, the change data is generated in the change table. The asynchronous CDC uses a database redo log (redolog) file to capture data after changes have occurred in the source database.

2.7.2 Oracle Flash back query mode

The oracle9i version of the database system provides a flashback query mechanism that allows users to query the state of the database at some point in the past. In this way, the extraction process can compare the current state of the source database with the state of the last extraction time, and quickly obtain the change of the source table data record.

3 comparison and analysis

Visible, ETL in the incremental extraction operation, there are various mechanisms can be selected. The pros and cons of these mechanisms are compared and analyzed from 4 aspects of compatibility, completeness, performance and intrusion.

Compatibility

Data extraction needs to face the source system, and not necessarily all the relational database system. The case that an ETL process needs to extract Excel or CSV text data from a legacy system several years ago is often made by cattle. At this time, all the incremental mechanism based on the relational database product can not work, the timestamp mode and the whole table comparison method may have some useful value, in the worst case, only discard the idea of incremental extraction, instead of using the whole table to delete the insert.

Completeness of

In terms of completeness, the timestamp method cannot capture the delete operation and needs to be used in conjunction with other methods.

Performance

The performance factors of the incremental extraction are shown in two aspects, one is the performance of the extraction process itself, and the other is the negative effect on the performance of the source system. Trigger mode, log table mode, and System log parsing method The performance of the incremental extraction is better because there is no need to perform a comparison step during the extraction process. Full-table alignment requires a complex alignment process to identify changes to the record, with the worst extraction performance. In terms of the performance impact on the source system, the trigger mode is due to the creation of triggers directly on the source system business table, while writing temporary tables, there may be some performance loss for frequently operated business systems, especially when performing bulk operations on the business tables, which can have a serious impact on performance. , the synchronous CDC approach is implemented by means of a trigger, and there is also the problem of performance impact; the full table alignment and log table approach has no effect on the performance of the data source system database, except that they require additional operations and database operations from the business system, with little time loss, time stamping, The impact of System log parsing and System log analysis (asynchronous CDC and flashback query) on database performance is also very small.

Invasive

The intrusion of the data source system refers to whether the business system should make the function modification and the additional operation for implementing the incremental extraction mechanism, at this point, the timestamp method deserves special attention. In addition to modifying the Data source system table structure, for relational database products that do not support automatic timestamp field updates, you must also modify the functionality of the business system to explicitly update the timestamp field of the table when the source table T performs each operation, which ETL implementation process must be a high degree of data source system to achieve, and in most cases, this requirement in the data source system seems to be more "excessive", which is the time-stamping method can not be widely used the main reason. In addition, the trigger mode requires the creation of triggers on the source table, which in some cases is also rejected. There are also ways to create temporary tables, such as the full table alignment and the log table. may not be implemented because of restrictions on database permissions that are open to the ETL process. The same situation can occur on the basis of system log analysis, because most database products allow only a specific group of users or even DBAs to perform log analysis. The impact of flash-back apricot polling on invasive aspects is minimal.

Review:

Through the comparison and analysis of various incremental extraction mechanisms, we find that no mechanism has an absolute advantage, and the performance of different mechanisms in various factors is generally relatively balanced. Poor compatibility mechanism, such as CDC and flashback query mechanism, due to the full use of the data source system DBMS features, relatively good overall advantage; The easiest and best-compatible full-table delete insertion mechanism is at the expense of decimation performance. , the System log analysis mode to the source business system functions without any changes, the source system tables do not need to set up a trigger, and the extraction performance is good, but it may be necessary to open the source System DBA authority to the ETL extraction process, and self-analysis log system is more difficult, different database system log format is inconsistent, This limits the scope of its use to a certain extent. Therefore, in the process of ETL implementation to choose which incremental extraction mechanism, according to the actual data source system environment to make decisions, need to consider the type of source system database, the amount of data extracted (determine the performance requirements of the degree of severity), the source business system and database control capability and the difficulty of the implementation of various factors, It even combines various incremental mechanisms to implement ETL implementations for data source systems that are different from the environment.

4 concluding remarks

In order to realize the efficient updating of data warehouse data, incremental extraction is an important part of ETL data extraction, and its implementation mechanism directly determines the overall implementation effect of ETL. By comparing several common incremental extraction mechanisms, we summarize the characteristics of various mechanisms and analyze their merits and demerits. All kinds of incremental extraction mechanism have its existence value and inherent restriction condition, in the ETL design and implementation work process, only according to the actual environment of the project comprehensive consideration, even need to use the various mechanisms to carry out the actual test, can determine an optimal incremental extraction method.

The data increment extraction mechanism in ETL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.