Several strategies for incremental data processing of general database

Source: Internet
Author: User
Tags implement

Several scenarios of general incremental data processing

Usually in the case of less data, we can take all the data from a data source to the target database when the strategy can be taken is: first of all the data of the target database emptied out, and then all again loaded from the data source. This is one of the simplest and most intuitive and not error-prone solutions, but in many cases it poses a performance problem.

If our data source comes from different business systems, the data is millions, tens of billions of dollars. The first need to load all, if the second cycle or the third cycle is still full load, the cost of a great physical and temporal resources. It is possible that some of the data sources have not changed, and some data sources may just add a small amount of data.

The problem we have to consider is that for the data in the target database that already exists, we should consider only the newly modified records and the newly inserted records, only those two data should be considered for the data in the data source. So incremental processing is essentially a process of changing data.

Let's take a look at these tables, ignoring the point of view of the Data warehouse design, to consider only how to implement incremental data detection and extraction.

Class One-absolute history tables with timestamps or self-growing columns

This table can represent some of the characteristics of the data source-Absolute historical fact data. It refers to the fact that the data in the table is irreversible and that only the insert operation does not delete or modify the operation to represent the factual business data that has been completed over a period of time. For example, this table represents some of the products downloaded information, when the user downloaded the product will be recorded in the database a data.

This kind of data table usually provides a list of historical time that can record the generation of this recording, or the time that the operation occurs, the sooner the operation time is higher, the later the operation time is later.

The incremental processing strategy for this type of table is:

After the first load action completes, record the maximum point in time and save it to a loading record table.

Start with the second load to compare the last/maximum time saved by the last operation, loading only the data after this point in time.

Update the Load record table after all successful completion of the loading process, updating the last point in time.

Also, you can use the self-growth column to implement this identity feature if the table has a self growing column.

Category II-Data tables with modified time features

The data in such a table is generally a kind of data that can be modified with maintenance, such as membership information table, when creating a member will generate a record, will be marked in the CreateDate, and in the updatedate is also saved createdate value. When CreateDate and updatedate the same time that this data is inserted, but this member's information can be edited and modified, so each update also updated the updatedate timestamp.

Assuming that the above data is first loaded into the target database, the source table adds a new member record and modifies a member's information at the same time.

So the strategy for incremental data processing like this can be:

After the first load action completes, record the maximum updatedate timestamp and save it to a loading record table. (first time is 2010-10-23)

When the data is loaded for the second time, the time stamp in the loading record table is compared with the updatedate in the source, and the description of the larger time stamp is the newly added or modified data. (More than 2010-10-23 is the first Update data and the fourth new data)

When the entire load process succeeds, update the largest updatedate to the record table. (recorded in the Record form 2010-10-26)

Note, however, that not every data table with the modified time feature is designed so that it is possible to insert the data only into createdate but not write to Updatedate. In this way, it may be necessary to compare createdate and updatedate at the same time during each load.

Class III-Non-temporal feature data tables for associative editing information

This type of table itself does not have any ID or timestamp that can be identified, only the basic information, all editing operations, and so on, have a single table to record. Such a design can be to record all the editing history information separately, but at the same time retain the independence of the main information, query the main table when the smaller size of the query to provide efficiency. A design similar to this one can refer to the first and second categories of design, and in this example is to associate the Member Audit History table with a timestamp or a self growing ID.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.