The incremental extraction method of ETL

Source: Internet
Author: User

1. Trigger mode
The trigger mode is an incremental extraction mechanism commonly adopted. The method is based on the extraction requirements, on the source table to be extracted to insert, modify, delete 3 triggers, whenever the data in the source table changes, the corresponding trigger will change the data to a Delta log table, ETL incremental extraction is from the Delta Log table instead of directly in the source table to extract data, At the same time, the extracted data in the Delta Log table is marked or deleted in time. For simplicity, the incremental log table does not typically store all the field information for the delta data, but only the source table name, the updated keyword value, and the update operation type (Knsen, update, or delete), the ETL incremental extraction process first based on the source table name and the updated keyword value. The corresponding complete record is extracted from the source table, and the target table is processed accordingly according to the type of update operation.

For example, for a database with a source table of Oracle type, the process of incremental data capture by means of a trigger is as follows:

In this way, all DML operations on table T are recorded in the Delta Log table Dml_log, noting that the Delta Log table does not fully record the delta data itself, only the source of the incremental data is recorded. When you make an incremental ETL, you only need to reverse the source table to get real incremental data based on the record in the Delta log table.
SQL code
(1) Create an incremental log table Dml_log:
CREATE TABLE Dml_log (
ID number PRIMARY key,//self-increment primary key
TABLE NAME VARCHAR2 (200). source table Name
Record ID number,//primary key value of the source table increment record
DML TYPE ch root (1). ∥ increment type, i means NEW: U means update; d means delete
EXECUTE Date Date//occurrence time
);

(2) Create a sequence seq_dml_log for Dml_log so that the trigger writes the Delta Log table when the ID value is generated.
(3) For each table you want to listen to, create a trigger, such as creating a trigger on the table test as follows:
CREATE or REPLACE TRIGGER T before INSERT OR UPDATE
OR DELETE on T for each ROW
DECLARE 1 DML TYPE VARCHAR2 (1);
BEGIN
IF INSERTING then l_dml type:= I ';
elsif UPDATING then i_dml_type:=. TY;
elsif DELETING then l_dml_type:= D ';
ENDIF;

IF DELETING Then
INSERT into Dml_log (id,table_name,record-
Id,execute_date,dmljype)
VALUES (Seq_dml_log. Nextval, ' TEST,: old. Id,sysdate,
L_dml_type);
ELSE
INSERT into Dml_log (id,table_name,record_
Id,execute_date,dmljype)
VALUES (Seq_dml_log. Nextval,. TEST,: NEW. Id,sysdate,l
Tirol_type);
ENDIF;
END;

2. Time Stamp mode

When the timestamp method refers to incremental extraction, the extraction process determines which data to extract by comparing the system time with the value of the timestamp field of the source table. This approach requires adding a timestamp field to the source table and modifying the value of the timestamp field while updating the table data in the system. Some databases (such as SQL SERVER) have timestamps that support automatic updates, that is, when data changes in other fields of the table, the value of the timestamp field is automatically updated to the moment the record changes. In this case, the ETL implementation only need to add a timestamp field to the source table. For databases that do not support automatic timestamp updates, this requires the business system to manually update the timestamp field programmatically when updating business data. The use of timestamps can normally capture the insert and update operations of the source table, but there is nothing to do with the deletion, which needs to be done in conjunction with other mechanisms.

Update Timestamp:

3, the whole table Delete Insert method

Full table Delete Insert means delete the target table data before each extraction, and load the data completely. This approach actually equates an incremental extraction with a full-volume extraction. For a small amount of data, the time cost of full-scale extraction is less than the algorithm and the conditional cost of performing the incremental extraction, which can be used.

4, the full table comparison method

In the case of an incremental extraction, the ETL process compares the records of the source table and the target table, and reads the new and modified records. After the optimization of all the alignment is the use of MD5 check code, you need to set up a similar structure of the table MD5 temporary table, the temporary table records the primary key value of the source table and the data based on all the fields of the source table (BI)

MD5 check code, each time the data extraction, the source table and the MD5 temporary table for the MD5 checksum, if there are different, update operation: If the target table does not exist the primary key value, indicating that the record has not yet, insert operation.

Then, you need to perform a delete operation on the primary key value that is not already present in the source table and the target table remains.

5. Log table mode

For the production database that establishes the business system, the business Log table can be created in the database, and the Maintenance log table content is updated by the corresponding Business System program module when the business data that needs to be monitored changes. Incremental extraction,

Determines what data is loaded and how it is loaded by reading the log table data. The maintenance of the log table needs to be done by the business System program in code.

6. System Log Analysis method

This method can judge the changing data by analyzing the log of the database itself. Relational plow the database system will store all DML operations in the log file for the database backup and restore functions. ETL Halo extraction process through the analysis of the database log, extract the relevant source table after a certain time after the DML operation information, you can know the time since the last time the table data changes, so as to guide the incremental extraction action. Some database systems provide a dedicated package of access logs (such as Oracle's Logminder), which greatly simplifies the parsing of database logs.

, specific database methods (ORACLE)
The following is a common view of the unique database system of the extraction method.
7.1 Oracle Change data Capture (CHANGEDDATACAPTURE,CDC) mode: The ORACLECDC feature was introduced in the Oraele9i database. The CDC can help identify data that has changed since the last extraction.
With CDC, data can be extracted at the same time as insert, upclate, or delete on the source table, and the changed data is saved in the table of changes in the database. This allows you to capture the changed data and then use the database view to provide the ETL extraction process in a controlled manner as a basis for incremental extraction. The CDC approach captures the changes to the source table data in two ways: synchronous CDC and asynchronous CDC. Synchronous CDC uses the source database trigger to capture the changed data. This approach is real-time, without any delay. When the DML operation commits, the change data is generated in the change table. The asynchronous CDC uses a database redo log (redolog) file to capture data after changes have occurred in the source database.
7.2 Oracle Flashback Query mode: oracle9i The above version of the database system provides a flashback query mechanism that allows the user to query the database state at some point in the past. This allows the extraction process to convert the source database (BI)
The current state is compared with the state of the last extraction time, and the change of the source table data record is quickly obtained.

8. Comparison and analysis

Visible, ETL in the incremental extraction operation, there are various mechanisms can be selected. The pros and cons of these mechanisms are compared and analyzed from 3 aspects of compatibility, completeness, performance and intrusion. Data extraction needs to face the source system, and not necessarily all the relational database system. The case that an ETL process needs to extract Excel or CSV text data from a legacy system several years ago is often made by cattle. At this time, all the incremental mechanism based on the relational database product can not work, the timestamp mode and the whole table comparison method may have some useful value, in the worst case, only discard the idea of incremental extraction, instead of using the whole table to delete the insert. In terms of completeness, the timestamp method cannot capture the delete operation and needs to be used in conjunction with other methods. The performance factors of the incremental extraction are shown in two aspects, one is the performance of the extraction process itself, and the other is the negative effect on the performance of the source system. Trigger mode, log table mode, and System log parsing method The performance of the incremental extraction is better because there is no need to perform a comparison step during the extraction process. Full-table alignment requires a complex alignment process to identify changes to the record, with the worst extraction performance. In terms of the performance impact on the source system, the trigger mode is due to the creation of triggers directly on the source system business table, while writing temporary tables, there may be some performance loss for frequently operated business systems, especially when performing bulk operations on the business tables, which can have a serious impact on performance. , the synchronous CDC approach is implemented by means of a trigger, and there is also the problem of performance impact; the full table alignment and log table approach has no effect on the performance of the data source system database, except that they require additional operations and database operations from the business system, with little time loss, time stamping, The impact of System log parsing and System log analysis (asynchronous CDC and flashback query) on database performance is also very small. The intrusion of the data source system refers to whether the business system should make the function modification and the additional operation for the implementation of the extraction mechanism, at this point, the timestamp method deserves special attention. In addition to modifying the Data source system table structure, for relational database products that do not support the automatic updating of timestamp fields, the functions of the business system must also be modified , let it explicitly update the timestamp field of the table when the source table T performs each operation, which must be highly coordinated with the data source system in the ETL implementation process, and in most cases this requirement appears to be more "excessive" in the data source system, which is the main reason why the timestamp method cannot be widely used. In addition, the trigger mode requires the creation of triggers on the source table, which in some cases is also rejected. There are also ways to create temporary tables, such as the full table alignment and the log table. may not be implemented because of restrictions on database permissions that are open to the ETL process. The same situation can occur on the basis of system log analysis, because most database products allow only certain groups of users or even DBAs tocan perform log analysis. The impact of a flashback query is minimal in terms of its invasive nature.

The incremental extraction method of ETL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.