Bi project notes incremental ETL data extraction policies and methods

Source: Internet
Author: User

Incremental extraction incremental extraction only extracts new or modified data from the table to be extracted from the database since the last extraction. During ETL usage. Incremental extraction is more widely used than full extraction. How to capture changed data is the key to incremental extraction. There are generally two requirements for the capture method: accuracy, which can accurately capture the changed data in the Business System at a certain frequency; performance, cannot cause too much pressure on the business system, affects existing services. Currently, the following methods are commonly used to capture changed data in incremental data extraction:
A. Trigger: Create a required trigger on the table to be extracted. Generally, create three triggers: insert, modify, and delete. Each time the data in the source table changes, the corresponding trigger writes the changed data to a temporary table, the extraction thread extracts data from the temporary table, and the data extracted from the temporary table is marked or deleted. The advantage of the trigger method is that the data extraction performance is high, but the disadvantage is that a trigger must be created for the business table, which has a certain impact on the business system.

B. Timestamp: It is a Change Data Capture Method Based on Snapshot comparison. It adds a timestamp field to the source table. When the system updates and modifies table data, it also modifies the value of the timestamp field. When data is extracted, you can compare the values of the system time and timestamp fields to determine which data to extract. Some databases support automatic update of timestamps, that is, when the data of other fields in the table changes, the value of the timestamp field is automatically updated. Some databases do not support automatic timestamp update, which requires the Business System to manually update the timestamp field when updating business data. Like triggers, the timestamp method provides better performance and data extraction is relatively clear and simple, however, the service system is also highly skewed (with additional timestamp fields), especially for databases that do not support automatic timestamp updates, the business system is also required to perform additional timestamp update operations. In addition, the delete and update operations on data before the timestamp cannot be captured, and the data accuracy is limited.

C. Full table comparison: A typical full table comparison method uses the MD5 verification code. The ETL Tool creates an MD5 temporary table with a similar structure for the table to be extracted. The temporary table records the primary key of the source table and the MD5 Verification Code calculated based on the data of all fields. During each data extraction, compare the source table with the MD5 temporary table to determine whether to add, modify, or delete the data in the source table, and update the MD5 verification code. The advantage of the MD5 method is that the source system has a small inclination (only an MD5 temporary table needs to be created), but the disadvantage is also obvious. It is different from the active notification in the trigger and timestamp methods, the MD5 method is used to passively compare the full table data, resulting in poor performance. When a table does not have a primary key or a unique column and contains duplicate records, the accuracy of the MD5 method is poor.

D. Log comparison: Analyze the logs of the database to determine the changed data. Oracle's Change Data Capture (CDC, changeddatacapture) technology is representative in this regard. The CDC feature is introduced in the Oracle9i database. CDC helps you identify data that has changed since the last extraction. Using CDC, you can extract data while performing insert, update, or delete operations on the source table, and the changed data is stored in the database change table. In this way, you can capture the changed data and use the database view to provide it to the target system in a controllable manner. The CDC architecture is based on the Publisher/subscriber model. The publisher captures the changed data and provides it to the subscriber. The subscriber uses the changed data obtained from the publisher. Generally, the CDC system has one publisher and multiple subscribers. The publisher must first identify the source table required to capture the changed data. It then captures the changed data and saves it to the specially created change table. It also enables the subscriber to control access to changed data. The subscriber needs to know what changes he is interested in. A subscriber may not be interested in all the data published by the publisher. The subscriber needs to create a subscriber view to access the changed data authorized by the publisher. CDC is divided into synchronous mode and asynchronous mode. The synchronous mode captures change data in real time and stores it in the change table. Both the publisher and subscription are located in the same database. The asynchronous mode is based on the stream replication technology of oracle.

To sum up, this bi project adopts the timestamp method.

Bi project notes incremental ETL data extraction policies and methods

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.