What is Data extraction

Source: Internet
Author: User

What is Data extraction

Data extraction refers to the data that is needed to extract the destination data source system from the source data source system. In practical application, the data source is mostly used in relational database.

Edit

How data is pumped

(a) Full-scale extraction

Full-volume extraction is similar to data migration or data replication, which extracts the data from a table or view in the data source intact from the database and transforms it into a format that your ETL tools can recognize. Full-volume extraction is relatively simple.

(b) Incremental extraction

Incremental extraction extracts only the data that has been added or modified in the tables that were extracted from the database since the last extraction. In the process of ETL use, incremental extraction is more widely used than full-scale extraction. How to capture changing data is the key to incremental extraction. There are two requirements for the capture method: accuracy, the ability to accurately capture the change data in the business system at a certain frequency; performance, can not cause too much pressure on the business system, affect the existing business. The methods of capturing change data commonly used in incremental data extraction are:

(a) Flip-flop mode (also applause)

To create the required trigger on the table to be extracted, it is generally necessary to create insert, modify, delete three triggers, whenever the data in the source table changes, the corresponding trigger will change the data to a temporary table, the extraction thread from the temporary table to extract data, temporary table in the extracted data is marked or deleted. Advantages: Data extraction performance is high, ETL loading rules are simple, fast, do not need to modify the business system table structure, can realize the data of the recursive increase load. Disadvantage: The business table is required to set up a trigger, the business system has a certain impact on the source database is easy to pose a threat.

(b) Time stamp mode

It is a kind of change data capture method based on snapshot comparison, adding a timestamp field on the source table, and modifying the value of the timestamp field while updating the table data in the system. When you do a data pump, you decide which data to extract by comparing the values of the time and timestamp fields that were last extracted. Some database timestamps support Automatic Updates, that is, when the data for other fields in the table changes, the value of the timestamp field is automatically updated. Some databases do not support automatic timestamp updating, which requires the business system to update the timestamp field manually when updating the business data.

Advantages: Same as the trigger way, the performance of time stamping is also better, ETL system design clear, the source data extraction is relatively clear and simple, can achieve the data of the recursive increase load.

Cons: Time-stamping maintenance needs to be done by the business system, and the business system also has a great inclination

(Adding additional timestamp fields), especially for databases that do not support time stamps for automatic Updates, and require additional update timestamp operations on the business system, and the inability to capture delete and update operations on previous data in the timestamp has been limited in data accuracy.

(c) Full table Delete Insert mode

Each ETL operation deletes the target table data, and the ETL loads the data completely.

Advantages: ETL loading rules are simple and fast.

Disadvantage: For the dimension table plus foreign key is not suitable, when the business system produces delete data operation, the comprehensive database will not be recorded to the deleted historical data, can not achieve the data of the recursive increase load, and the target table to establish an association relationship, you need to re-create.

(d) Full table alignment

The whole table is the way the ETL tool in advance for the table to be extracted a similar structure of the temporary table, the temporary table record source table primary key and based on the data of all fields calculated, each time the data extraction, the source table and the temporary table to do the comparison, if there are different, update operation, If the primary key value does not exist for the target table, it indicates that the record is not already in the insert operation.

Advantages: There is no impact on the existing system table structure, do not need to modify the business operation procedures, all extraction rules are completed by ETL, management and maintenance unified, you can achieve the data of the recursive increase load, no risk.

Disadvantage: ETL is more complex than the comparison, the design is more complex, the speed is slow. Unlike the active notification in the trigger and timestamp mode, the full table alignment method is passive for full table data comparison and performance is poor. When there is no primary key or unique column in the table and there are duplicate records, the accuracy of the full table alignment is poor.

(e) Log table mode

Add the System Log table to the business system and, when the business data changes, update the maintenance log table content, as the ETL loads, by reading the log table data to determine how to load the data and how to load it. Advantages: No need to modify the business system table structure, the source data extraction is clear, faster. Can achieve the data of the recursive increase load. Disadvantage: Log table maintenance needs to be completed by the business system, the business system business operation procedures need to be modified to log information. Log table maintenance is more troublesome, the original system has a greater impact. Heavy workload, large changes, there is a certain risk.

(f) Oracle change data Capture (CDC mode)

The data is judged by analyzing the log of the database itself. Oracle's change data Capture (cdc,changed) technology is representative of this. The CDC feature was introduced in the Oracle9i database. CDC can help you identify data that has changed since the last extraction. With CDC, data can be extracted at the same time as an INSERT, update, or delete operation on the source table, and the changed data is stored in the database's change table. This allows you to capture the changed data and then make it available to the target system in a controlled manner using the database view. The CDC architecture is based on the publish/subscribe model. The publisher captures the change data and provides it to subscribers. Subscribers use the change data obtained from the publisher. Typically, the CDC system has one publisher and multiple subscribers. The publisher first needs to identify the source tables that are required to capture the change data. It then captures the changed data and saves it in a specially created change table. It also enables subscribers to control access to changing data. Subscribers need to be aware of what change data they are interested in. A subscriber may not be interested in all the data published by the publisher. Subscribers need to create a subscriber view to access change data that is authorized by the Publisher. CDC is divided into synchronous mode and asynchronous mode, the synchronous mode captures the change data in real time and stores it in the change table, both the Publisher and the subscription are in the same database; the asynchronous Pattern is Oracle-based streaming replication technology.

Pros: Provides an easy-to-use API to set up the CDC environment and Shorten ETL time. There is no need to modify the business system table structure, can realize the data of the recursive increase load.

Disadvantage: The Business System database version and product is not unified, it is difficult to achieve a unified implementation, the implementation process is relatively complex, and needs in-depth research to achieve. or through third-party tools, expensive.

What is Data extraction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.