ETL Extraction Scheme

Source: Internet
Author: User
Tags system log

Second, ETL extraction scheme
The main link in ETL process is data extraction, data conversion and processing, data loading. In order to achieve these achievements
Can, the ETL tool will perform some functional expansion, such as workflow, scheduling engine, rule engine, script support,
Statistical information, and so on. Data extraction
Data extraction is the process of extracting data from a data source. In practical application, the data source is more used in relational database.
There are several ways to extract data from a database:
2.1.1 Total Quantity Extraction
A full decimation is similar to data migration or data replication, which takes the data from a table or view in the data source intact from the number
It is extracted from the library and converted into a format that can be recognized by its own ETL tool. Full-volume extraction is simpler.
2.1.2 Incremental Extraction
An incremental pump extracts only the data that has been added or modified from the table that is being extracted in the database since the last extraction. In the ETL make
In the process, the incremental decimation is more widely used than the total quantity extraction. How to capture the changed data is the key to incremental extraction. To capture
Methods generally have two points: accuracy, can be the business system changes in the frequency of data capture accurately;
Performance, can not cause too much pressure on the business system, affecting the existing business. The commonly used capture changes in incremental data extraction at present
Methods of data are:
2.1.2.1 trigger mode (and rapid):
To create the required triggers on the table to be extracted, you typically create inserts, modifies, and deletes three triggers, whenever the source
When the data in the table changes, the corresponding trigger writes the changed data to a temporary table, extracting the thread from the temporary table
The data extracted in the temporary table is marked or deleted.
Advantages: High performance data extraction, ETL loading rules simple, fast, do not need to modify the business system table structure,
can achieve the recursive load of data.
Disadvantages: The business table is required to set up triggers, which has a certain impact on the business system.
2.1.2.2 Time Stamp Method:
It is based on the snapshot comparison of the change data capture mode, add a timestamp field on the source table, the system is more
Changes the value of the timestamp field while the table data is newly modified. When the data is extracted, the comparison between the system time and
The value of the timestamp field to determine which data to extract. The timestamp of some databases supports automatic updates, that is, other fields of the table
Automatically updates the value of the timestamp field when the data changes. Some databases do not support automatic timestamp updates, which
Requires the business system to manually update the timestamp field when updating the business data.
Advantages: Same as the trigger mode, the time stamp mode of performance is also better, ETL system design clear, source data pumping
The relatively clear and simple, can realize the data of the recursive load.
Disadvantage: Timestamp maintenance needs to be done by the business system, and there is a great inclination to the business system (add extra time
Inter-stamp fields, especially for databases that do not support automatic updates of timestamps, and require additional updates from the business system
Time stamp operation, large amount of work, large alteration, high risk; In addition, the delete of previous data to the timestamp cannot be captured
and update operations, there is a certain limit on the accuracy of the data.
2.1.2.3 Full Table Delete Insert method
Each ETL operation deletes the target table data, and the ETL loads the data completely.
Advantages: ETL loading rules are simple and fast.
Disadvantage: For dimension table plus surrogate key is not suitable, when the business system produces deletes the data operation, the synthesis database will not
Records the deleted historical data, can not implement the data of the recursive load, while for the target table established by the association relationship,
A new creation is required.
2.1.2.4 Full table alignment:
The whole table alignment is the use of MD5 check code, ETL tools in advance for the table to be extracted to create a similar structure
MD5 a temporary table that records the primary key of the source table and the MD5 checksum based on data from all fields,
Each time the data pump, the source table and MD5 temporary table for the comparison of MD5 check code, if there is a difference,
Update operation, such as the target table does not exist the primary key value, indicating that the record has not yet, that is, the insert operation.
Advantages: No impact on the existing system table structure, no need to modify the business operating procedures, all extraction rules by the ETL
Complete, maintain unified management, can realize the data of the delivery load, no risk.
Disadvantages: The ETL is more complex, the design is more complex, the speed is slow. Active communication in the form of triggers and timestamp
Knowledge is different, the whole table alignment is passive for the whole table data comparison, poor performance. When there is no primary key or unique column in the table
and contains duplicate records, the accuracy of the whole table comparison method is poor.
2.1.2.5 Log Table method
Add the System Log table to the business system and update the maintenance Log table contents when the business data changes, as
When ETL is loaded, the data is determined by reading the log table data and how to load it.
Advantages: No need to modify the business system table structure, the source data extraction clear, faster. Can increase the number of data
Load.
Disadvantages: Log form maintenance needs to be completed by the business system, the business system business operating procedures to be modified, record day
Log information. The maintenance of the log table is more troublesome and has great influence on the original system. Heavy workload, large changes, a certain wind
Risk.
2.1.2.6 Oracle change Data Capture (CDC method):
The data is judged by analyzing the log of the database itself. Oracle's change data capture (CDC,
Changed Data Capture) technology is a representative of this aspect. CDC features are introduced in the Oracle9i database
Of The CDC can help you identify data that has changed since the last extraction. Using the CDC, the source table is
Insert, update, or delete operations can extract data at the same time, and the changed data is stored in the data
Table of changes in the library. This allows you to capture the changed data and then use the database view in a controllable way to extract
Supply the target system. The CDC architecture is based on the publisher/subscriber model. Publishers capture change data and provide it to subscriptions
Actors Subscribers use the change data obtained from the publisher. Typically, the CDC system has a publisher and a number of subscribe
Read by. The publisher first needs to identify the source tables needed to capture the change data. It then captures the changed data and saves it
In a specially created change table. It also enables subscribers to control access to change data. Subscribers need to be aware of their own feelings
Interested in what changes the data. A subscriber may not be interested in all the data published by the publisher. Subscribers need
To create a Subscriber view to access change data that can be accessed by the publisher's authorization. The CDC is divided into synchronous mode and different
Step mode, synchronous mode real-time capture of change data and stored in the change table, the publisher and the subscription are in the same database
Asynchronous mode is an Oracle based streaming replication technology.
Benefits: Provides an Easy-to-use API to set up a CDC environment that shortens ETL time. No need to modify the business system
Table structure, can be implemented to achieve the data of the recursive load.
Disadvantages: The Business System database version and the product is not unified, difficult to unify implementation, the implementation process is relatively complex, and needs
In-depth research can be achieved, the CDC product launch time is short, it is inevitable that there are bugs.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.