Original: SQL Server audit feature Getting Started: CDC (change Data Capture)
Introduction
SQL Server 2008 introduces CDC (Change Data Capture), which can record:
1. What data lines have changed
2. The history of data row changes, not just the final value.
It implements asynchronous change tracking (like transactional replication) through the job, compared to the Tracking CT, which is implemented synchronously. As a result, it has less impact on performance and does not affect transactions.
A typical application is to extract, transfer, and load data to other data sources, just like the Data warehouse in the diagram.
Realize
Microsoft recommends that the CDC be used in conjunction with snapshot snapshot isolation levels to avoid read-write blocking of reading change data and change data writing.
It is important to note that the snapshot isolation level has additional overhead, especially tempdb (all data changes are versioned to tempdb).
Use Master Go Create Database cdctest Go Alter Database Set on Go -- enable CDC on database Cdctest Use cdctest Go exec sys.sp_cdc_enable_db Go
When CDC is enabled, a schema called CDC and a series of system tables, SPS, and view are added. The official recommendation is not to query the system tables directly but to use the corresponding system SP/FN to obtain CDC data.
System objects |
Description |
Recommended objects for use |
Cdc.captured_columns |
Returns a row for each column tracked in the capture instance |
Sys.sp_cdc_get_source_columns |
Cdc.change_tables |
Returns a row for each change table in the database |
Sys.sp_cdc_help_change_data_capture |
Cdc.ddl_history |
Returns one row for each data definition language (DDL) change made for a table with change data capture enabled |
Sys.sp_cdc_get_ddl_history |
Cdc.lsn_time_mapping |
Returns a row for each transaction that has rows in the change table |
SYS.FN_CDC_MAP_LSN_TO_TIME (Transact-SQL), SYS.FN_CDC_MAP_TIME_TO_LSN (Transact-SQL) |
Cdc.index_column |
Returns a row for each indexed column associated with the change table |
Sys.sp_cdc_help_change_data_capture |
Msdb.dbo.cdc_jobs |
Store change data capture configuration parameters for capturing and purging jobs |
NA |
Cdc.<capture_instance>_ct |
The change table that is created when change data capture is enabled on the source table. The table returns one row for each insert and delete operation performed on the source table, and two rows for each update operation performed on the source table. capture_instance Format =schamename_tablename |
Cdc.fn_cdc_get_all_changes_<capture_instance>, Cdc.fn_cdc_get_net_changes_<capture_instance> |
Create a test table and enable CDC for the period. Use sys.sp_cdc_enable_table to enable CDC for a table.
--Create a test table for CDC UsecdctestGOCreate TableTB (IDint Primary Key, namevarchar( -), Weightdecimal(Ten,2));GoEXECUTEsys.sp_cdc_enable_table@source_schema =N'dbo' , @source_name =N'TB' , @role_name = NULL;GO
Sys.sp_cdc_enable_table also creates a capture and cleanup job for the database if the source table is the first table in the database to enable change data capture, and the database does not have a transactional publication. It sets the IS_TRACKED_BY_CDC column in the Sys.tables catalog view to 1.
The corresponding trace table cdc.dbo_tb_ct contains all the change data for the source table. It contains all of the original columns and 5 new columns, the structure
Validation
When you conduct data change operations in the source table, the table cdc.dbo_tb_ct is recorded. Try it:
Why is there no data? Because the CDC was working on a job to capture change data, my agent hasn't run yet.
When enabled manually, data is available.
Meaning of the result column:
bitmask that corresponds to one bit for each column in the source table that is tracked by the CDC. If __$operation = 1 or 2, the value sets all defined bits to 1. If __$operation = 3 or 4, only those bits that correspond to the changed column are set to 1.
Column name |
Data type |
Description |
__$START_LSN |
Binary (Ten) |
Change the committed LSN. Changes committed in the same transaction will share the same commit LSN value. |
__$seqval |
Binary (Ten) |
There may be multiple changes within a transaction, and this value is used to sort them. |
__$operation |
int |
Change the type of operation: 1 = delete 2 = insert 3 = Update (the captured column value is the value before the update operation). 4 = update (the captured column value is the value after the update operation). |
__$update_mask |
varbinary (+) | TD Width= "411" and a
Now insert the row again, update it, and then delete the Id=1 row. To see the results again:
A brief description of the tracked query results: A total of 5 rows, the first and second rows are inserted data, the third row and the fourth row is the data before and after the update, and the fifth row deletes the data. The type of operation is known by the _$operation value. Simple Application
The TB tables created in the previous article record each person's name and weight change information. Another database (table Tb_rs), which is the data source for the weight change trend report. It synchronizes data once a day and updates its own data. How to use CDC to achieve this demand?
START_LSN is logged in the CDC if you know the maximum LSN that was synchronized in TB when the Tb_rs last synchronization was completed. The next time you synchronize, you only need to synchronize the change records in the TB table that are larger than this LSN.
The problem is simple: get the maximum LSN of the last synchronization complete TB, get all the change records larger than this LSN, update tb_rs.
- The SYS.FN_CDC_MAP_TIME_TO_LSN can map the time to the corresponding LSN, which is the day before.
- The cdc.fn_cdc_get_net_changes_<capture_instance> can get all the Net Change records in a day.
- Custom synchronization logic and statements are made by the change record.
Insert intoTBValues(1,'Ken',70.2),(3,'Joe', the),(4,'Rose', -)UpdateTBSetWeight= -whereId=3;Delete fromTbwhereName='Rose';GoDECLARE @begin_time datetime,@end_time datetime,@begin_lsn binary(Ten),@end_lsn binary(Ten); --Get the intervalSelect @begin_time=GETDATE()-1,@end_time=GETDATE();--Map the time to LSN of the CDC table TBSelect @begin_lsn =SYS.FN_CDC_MAP_TIME_TO_LSN ('smallest greater than or equal',@begin_time), @end_lsn =SYS.FN_CDC_MAP_TIME_TO_LSN ('largest less than or equal',@end_time);--get the net changes within the specified LSNsSELECT * fromCDC.FN_CDC_GET_NET_CHANGES_DBO_TB (@begin_lsn,@end_lsn,' All');
There's no record of Rose? Joe's information has been updated, how is a record?
This is because this is the net Change line, which is the meaning of the final result. Added and then deleted, without affecting the final result, so no. Updates a column of data on the same row more than once, returning only the results of the last update.
With this result, we can define the logic of synchronizing the data based on the __$operation and the actual data. Like what:
--Generate Sync StatementsSELECT( Case__$operation when 2 Then 'INSERT into tb_rs values ('+cast(ID as varchar(2))+', '+Name+', '+cast(weight as varchar(Ten))+')' when 4 Then 'Update tb_rs Set name='+Name+', weight='+cast(weight as varchar(Ten))+'where id='++cast(ID as varchar(2))END) fromCDC.FN_CDC_GET_NET_CHANGES_DBO_TB (@begin_lsn,@end_lsn,' All');
For the updated rows, when synchronizing the data, I want to first determine if the dequeue has been changed or changed. The changed columns need to be synchronized, not all columns. Take name as an example:
DECLARE @begin_time datetime,@end_time datetime,@begin_lsn binary(Ten),@end_lsn binary(Ten); --Get the intervalSelect @begin_time=GETDATE()-1,@end_time=GETDATE();--Map the time to LSN of the CDC table TBSelect @begin_lsn =SYS.FN_CDC_MAP_TIME_TO_LSN ('smallest greater than or equal',@begin_time), @end_lsn =SYS.FN_CDC_MAP_TIME_TO_LSN ('largest less than or equal',@end_time);--get the all changes within the specified LSNsSELECT *,( CaseSys.fn_cdc_has_column_changed ('DBO_TB','name', __$update_mask) when 1 Then 'Yes' when 0 Then 'No' End) asisnameupdated,sys.fn_cdc_map_lsn_to_time (__$START_LSN) asUpdateTime fromCDC.FN_CDC_GET_ALL_CHANGES_DBO_TB (@begin_lsn,@end_lsn,' All')where__$operationinch(3,4);Go
The CDC can record not only DML operations, but also DDL operations. Query cdc.ddl_history.
But one thing to keep in mind : The new columns can be tracked by the CDC DDL, but the new column's data changes cannot be tracked by the CDC. If you need to track it, disable CDC on the table before you enable it.
CDC Agent Job
When CDC is first enabled in the specified database and there is no transactional replication, two jobs are created for capture and cleanup:
The capture job is used to scan the log files and write the change records to the change table. Called Sp_mscdc_capture_job, the scan parameters and scan intervals can be set based on the actual transaction throughput of the current library, resulting in a reasonable balance between performance overhead and tracking requirements.
The cleanup job is to clean up the data in the change table by default for three days.
Therefore, it is very important to set the interval of cleanup rationally.
The related configuration for these two jobs is stored in msdb.dbo.cdc_jobs. The current default configuration
Summary
1. CDC is easy to use, easy to configure, and can be used in conjunction with applications such as synchronous extraction.
2. The CDC can meet most of the data audit requirements, but cannot tell you who changed the data.
3. Although the CDC is asynchronous and has a small impact on performance, it increases overhead, especially IO Read and write and capacity. When CDC is turned on, at least one additional data file write and log file write is added to each change.
Getting Started with SQL Server audit features: CDC (change Data Capture)