Different data source features and corresponding incremental data capture strategies (RPM)

Source: Internet
Author: User

Introduction of CDC

Usually in the case of a small amount of data, when we load all the data from one data source into the target database, we can take the following policy: first, the target database's data are all emptied out, and then all re-loaded from the data source. This is one of the simplest and most intuitive and error-prone solutions, but in many cases it can cause performance problems.

If our data source comes from different business systems, the data is millions, tens of billions of dollars. All loading is required for the first time, and it takes a lot of physical and temporal resources to load the second or third cycle. It is possible that some data sources have not changed, and some data sources may have just added a small amount of data.

The problem we want to consider is that the data in the target database already exists is historical data, we should only consider the newly modified records and the newly inserted records for the data in the data source, only the two kinds of data should be considered. So incremental processing is essentially the processing of changing data.

Characteristics of different data sources and their corresponding processing methods

The processing strategy of incremental loading is not immutable, and it is very important to decide which loading strategy is related to the design of the data source. A good data source design may directly provide the most direct basis for subsequent incremental processing, such as self-growth columns, time-date stamps, and so on. There are also data source designs that may incorporate triggers that make effective logging during the process of adding, modifying, or deleting data. Or add a number of audit tables, in the process of adding and deleting data to record and track the details of the operation of the data.

Let's take a look at the different data source design, how to achieve incremental data detection and extraction.

First Class-Absolute historical data tables with timestamps or self-growing columns

This table can represent the characteristics of a subset of data sources-absolute historical factual data. It refers to the fact that the data in the table is irreversible, and only the insert operation is not deleted or modified to represent the factual business data that was completed over a period of time. For example, this table represents the download information of certain products, when the user downloads the product will record a data in the database.

This data table typically provides a list of historical times that the record is generated, or when the operation takes place, the earlier the operation time, the later the longer the operation time.

The incremental processing strategy for this type of table is:

    1. After the first load action is complete, record the maximum point in time and save it to a loading record table.
    2. The last/largest point in time saved from the last operation is compared starting with the second load, and only the data after that point in time is loaded.
    3. When the loading process has been completed successfully, update the loading record table and update the last time point.

In addition, if such a table has self-growing columns, you can also use the self-growing column to implement this identity feature.

Type II-data table with modified time characteristics

The data in this type of table generally belongs to can modify the maintenance nature of the data, such as the Membership information table, create a record when the member, will be marked in CreateDate, and the updatedate is also stored in the CreateDate value. When CreateDate and updatedate the same time to indicate that this data is an insert operation, but this member's information can be edited and modified, so each update also updated the updatedate timestamp.

Assuming that the above data is loaded into the target database for the first time, the source table adds a new member record and modifies the information of a member at the same time.

Then the strategy for incremental data processing in this case can be as follows:

    1. After the first load action is complete, record the maximum updatedate timestamp and save it to a load log table. (The first time is 2010-10-23)
    2. When the data is loaded for the second time, the time stamp in the load record table is compared to the updatedate in the source table, and the description that is larger than the timestamp is the newly added or modified data. (Greater than 2010-10-23 is the first Update data and fourth new data)
    3. When the entire loading process succeeds, update the maximum updatedate to the record table. (Records will be recorded in 2010-10-26)

It is important to note, however, that not every data table with a modified time feature will be designed so that it is possible to insert data into createdate but not write to Updatedate. In this case, it may be necessary to compare createdate and updatedate at the same time for each loading process.

Class III-no-time feature data table for associated editing information

This type of table itself does not have any identifiable self-growth ID or timestamp, only the basic information, all the editing operations and other information is specifically a table to record. This design can be used to record all the editing history information separately, but also retains the independence of the main information, when querying the main table, the query volume becomes smaller to provide query efficiency. Similar to this design can refer to the first class and the second class of design, in this example is to associate Member Audit History table and time stamp or self-growth ID judgment.

For the first three types of data tables, they can collectively use a Load record table to record their last timestamp or self-increment ID.

For Table A-

SELECT column 1, column 2  from table_a WHERE > (SELECTfromWHERE=")

For Table C-

SELECT column 1, column 2  from Table_c WHERE > (SELECTfrom WHERE='table_c')
Class fourth-no feature data table

Comparison of data based on unique columns.

A very simple concept-that is, each time the data in the data source is loaded, the query exists based on the primary key or the unique column into the target table, and if it does not exist, it is inserted. If there is a comparison of the key column data is equal, the unequal is modified.

Summarize

The strategy of incremental data loading has a great relationship with the data source, and it also has a great relationship with the actual demand, so in the process of designing the incremental data loading, we need to think carefully about the actual demand and the increment characteristic that the data source can provide, and even test it repeatedly to achieve the stability and reliability of the loading mechanism.

Original:

Several strategies for incremental data processing in general database and incremental processing in warehouse

Different data source characteristics and corresponding incremental data capture strategy (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.