[Post] kettle incremental update design skills

Source: Internet
Author: User
Kettle incremental update design skills

First, you need to determine whether you are processing a dimension table. If it is a dimension table, it may beSCDSituation, you can useKettleOfDimension LookupTo solve this problem, if you are dealing with fact tables, the methods may be different. The main difference between them is that the primary key judgment method is different.

Fact tables generally have a large amount of data. You must first determine whether the changed data is in a specific range under a specific condition, such as time, or some fields have certain conditions. Try to limit the result set to be processed to the maximum extent possible.IDTo determine the status of the record, whether it is to insert a new record, whether it already exists to be updated, or whether the record does not exist to be deleted, respectivelyIDTo perform different operations.

Delete usageDeleteSteps, its principles andInsert/updateThe steps are the same, but the matchingIDThen, delete the operation instead of update the operation, and then processInsert/updateOperation, you may need to re-create a conversion process, and thenJobIt defines the execution sequence between the two transformations.

If the amount of data changes is large, for example, if you exceed a certain percentage, you can re-create the table if the execution efficiency is low.

In addition, we need to consider how the data of a dimension table is deleted, and the foreign key constraint may not be easily removed if the corresponding fact table or other table data dependent on this dimension table is processed, in other words, once removed, the dependency data of the fact table may need to be processed first, mainly depending on how you apply the data, it is easier to simply delete data from a fact table. However, to retain the corresponding records of the fact table, you can add a record to the dimension table. This record has only one primary key, other fields are empty. After the dimension table data is deleted, the fact table data is updated to the empty dimension table record.

 

 

Scheduled incremental update

 

 

 

Sometimes we perform update operations on a regular basis, such as every day or every Monday. At this time, we do not need to add a timestamp field to the target table to determine whetherETLThe maximum time of the database is directly obtained with the following conditions:

Startdate>? And enddate <?

Or there is only oneStartdate

Startdate>? (Time of yesterday or last week)

In this case, you need to pass a parameter.Get system infoYou can also control the time precision, such as the time to day rather than second.

Of course, you also need to consider how to handle the update failure. For example, if the update fails for some reason one day, the record of the day may need to be manually processed, if a failure occurs frequently, it is still common to add a time field in the target database to obtain the maximum timestamp, although it has a seldom used field.

 

 

Execution efficiency and complexity

 

 

 

Deleting and updating are time-consuming operations. They all need to constantly query records in the database and execute delete or update operations. They are all executed one by one, it is foreseeable that the execution efficiency is low and the original dataset size may be reduced as much as possible. Reduce the size of transmitted DatasetsETLComplexity

 

 

Advantages and disadvantages of the timestamp Method

 

 

 

Advantages:The implementation method is simple, it is easy to implement across databases, and it is easy to design to run.

Disadvantage: a large amount of storage space is wasted.ETLIt is not used outside of the process. If it is scheduled, a certain operation fails, it may cause some data loss..

 

 

Other incremental update Methods:

The core issue of incremental updates is how to find data after the last update. In fact, most databases can capture this data change, A common method is Incremental backup and data replication of the database. Using the database management method to handle incremental updates requires better database management capabilities, most mature databases provide incremental backup and data replication methods.ETLIncremental update requires that the database be fully backed up and completelyStandDatabase, so the implementation method is relatively simple.As long as you create a table structure similar to the original table structure, and then create three types of triggersInsert, update, deleteAnd then maintain the new table.ETLDuring the process, stop Incremental backup or data replication, and then start reading the new table. After reading the new table, delete the data in the table, however, this method is not easy to execute regularly and requires a certain amount of database-specific knowledge. If you have high requirements on real-time data, you can implement a database data replication solution. If you have low requirements on real-time data, Incremental backup is easier.

 

 

Notes:

1. Trigger

Whether it is Incremental backup or data replication, if there is a trigger in the original table, do not keep the trigger in the backup database, because what we need is not a backup database, but the data in it, it is recommended that you do not need to process all unnecessary database objects and small tables.

2. Logical and physical consistency

In terms of database backup and synchronization, there is a so-called logical and physical consistency difference between databases. Simply put, the same query obtains the same total data on the backup database and the primary database, however, the data in each row may be arranged in different ways, as long as there is no obvious sorting query (includingGroup by, distinct, Union), And this may affect the way the primary key is generated. You need to consider this when designing the primary key generation method, such as explicitly addingOrderSort.Avoid primary key errors when you need to re-read the data.

 

 

Summary

 

 

 

Incremental update isETLA common task may adopt different policies for different application environments. This article cannot cover all application scenarios, for example, aggregating multiple data sources to a target database,IDGeneration policies, inconsistent business primary keys and proxy primary keys, and so on, just hope to give some ideas to deal with more common situations, hope to help everyone.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.