In general, the ETL implementation of incremental updates in two ways, the first: Record the maximum value of the field, if there is a continuous increase in the data source data column, record the last processed dataset, the maximum value of the column, the second is to save HashValue, quickly check all the data, After discovering the data for the move, only the modified data is updated synchronously.
1, record the maximum value of the field, use DateTime to record the last time the source data was updated, or record the last updated value using timestamp or rowversion
In general, when a DBA designs a table, the Lastupdatedtime field is created in the table, with the data type DateTime or DATETIME2 (7), which is used to record when the data was updated. When you use DateTime to record the last update time, the Lastupdatedtime field is updated to the current time value when an update or insert operation is required for the data source.
If you create a field of type timestamp or rowversion, then when you update or insert the data, the system is responsible for maintaining the field and ensuring its value continues to increase.
The feature of this approach is that the reference value is updated when an update is made to any column in the table. If the source system has some column that does not import ETL, and the source system frequently updates these column, it will cause the ETL to load each time the data, not "update" data.
2, save HashValue, use checksum or binary_checksum to get HashValue, synchronize the updated data
If there is no continuously increasing column in the table, it is not feasible to record the last updated value of the field. At this point, you can use hash functions such as checksum or binary_checksum to quickly examine the source data and save the HashValue to a Hashtable (the Hashtable has three columns: Business key, HashValue, Syncstatus). If there is a discrepancy between the hashvalue of a data row and the HashValue recorded in Hashtable, the data row changes.
The characteristic of using this way is: all the data source is scanned every time, get hashvalue; compare HashValue, only the data of the data is synchronized. Since the full scanning and hash insinuate of the source data will consume time, it is necessary to do more testing before using this method to monitor io,cpu and memory usage.
Note: Only the ETL needs columns hash mapping, save HashValue. The data row is not determined to change even if the other columns of the source table Update,etl.
3. Use HashValue to import the source data into ETL in batches for processing
If the processing of the source data is time-consuming, you can import the source data into the ETL in batches, increasing the degree of parallelism of the data processing to reduce the ETL execution time.
Hashtable has three columns: Business key, Hashvalue,syncstatus, to record the source data at the time of the last update HashValue
STEP1, scan the source data, get hashvalue, for the same data row, if the current hashvalue is different from the HashValue recorded in Hashtable, then set Syncstatus=0;etl to process all the data, Its syncstatus=0.
STEP2, the processing process is copied to n batches, and the hashvalue/n=0,1,2,,,n-1 data is introduced into the corresponding n batches respectively, so that n batch processes different data at the same time. After each batch processing is completed, the Syncstatus=1 will be set;
Optional Step3, in a batch, you can use a loop that handles only m data at a time and sets the syncstatus of the processed data to 1. For example, in batch 10th, where a loop processes only 200 data, the 200 data are characterized by: syncstatus=0,hashvalue/n=9.
The scenario for using this design is that the task that processes the data is time-consuming, because the task is synchronous, and a task must wait until the data is processed, so that the dataset can be divided into multiple, non-repeating subsets simultaneously, and each task performs data processing. Increase the concurrency of data processing and reduce the overall time of ETL.
SSIS Design2: Incremental update