SQL Server Integration Services (SQL Server integration services,ssis) has made a lot of progress on the foundation of its predecessor, the Data Transformation Services, the transformation service, from usability, Performance and parallelism, it has grown into an enterprise-class ETL (Extraction, Transformation and Loading, extract, transform and load) products, in addition to an ETL product, it also provides a variety of built-in tasks to manage SQL Server instance. While the internal architecture of SSIS has been designed to provide excellent performance and parallel processing capabilities, if you adhere to best practices and its performance can be further optimized, in this series of articles I will discuss the best practices of SSIS and I will share with you the experience I have learned and applied in the past few years.
As mentioned above, SSIS is an alternative to DTS (SQL Server 7/2000), and if you've ever applied DTS, you'll find that SSIS packages are very similar to DTS packages, but there have been a lot of changes in nature, and SSIS is not an enhanced version of DTS. Instead of a new product built from scratch, SSIS provides better performance and parallel processing power than DTS, and overcomes many of the limitations of DTS.
SSIS 2008 further enhances the internal data flow pipeline engine, providing better performance, and you may have seen an ETL World Chronicle created by SSIS 2008 that is loading 1TB data in half an hour.
The biggest benefit of SSIS is that it is a component of SQL Server that can be obtained free of charge with SQL Server installation and no longer required to purchase additional licenses for it, and BI developers, database developers, and DBAs can apply it to transform data.
Best Practice 1: Extract large amounts of data
Recently we extracted data from a large table with 300 million records, at first, when the SSIS package started, everything was OK, the data was converted as expected, but performance began to decline and the data conversion rate plummeted. Through analysis, we found that the target table has a primary gather key and two nonclustered keys because a large amount of data is inserted into the table, causing its index fragmentation level to reach 85%-90%. We used the index to reconstruct/reorganize the index on-line reconstruction feature, but during the load every 15-20 minutes, the index fragmentation level went back to 90%, the final data conversion and the parallel fulfillment of the online index reconstruction process took more than 12-13 hours, far beyond our expectations.
We came up with a way to get the index of the target table to be deleted, and then re-create the index after the conversion, and after that, the entire conversion process took 3-4 hours, exactly as we expected.
The whole journey I have drawn in the picture below. So I suggest that if possible, delete all indexes on the target table before inserting the data, especially when inserting large amounts of data.
Before converting the data, delete all indexes on the target table, then rebuild the index after conversion
Best Practice 2: Avoid applying SELECT *
The data flow Task,dft of SSIS applies a buffer as a staging point for data transfer and conversion, and when data is transferred from the source table to the target table, the data is first entered into the buffer, the data conversion is done in the buffer, and the conversion is completed before it is written to the target table.
The size of the buffer is limited by the server hardware itself, it estimates the size of the row, the row size is summed by the maximum value of all the column sizes in a row, so the more columns, the fewer rows into the buffer, the more demands on the buffer, and the slower performance. It is therefore better to explicitly specify the columns to be converted to the target table. Even if you need all the columns in the source table, you should explicitly specify the name of the column in the SELECT statement, and if you apply select *, it will go around the metadata of the source table's collection of columns, and the SQL statement can naturally grow in time.
If you convert a column that does not need the target table, SSIS will pop up a warning message such as:
Code highlighting produced by Actipro Codehighlighter (freeware)
--[SSIS. Pipeline] warning:the output column "Salariedflag" on Output "OLE DB Source Output" (one) and component "OLE DB SOURC E "(1) is not a subsequently used in the Data Flow task.
Removing this unused output column can increase Data Flow task performance.
[SSIS. Pipeline] warning:the output column "Currentflag" on output "OLE DB source Output" (one) and component "OLE DB source "(1) is a subsequently used in the Data Flow task.
Removing this unused output column can increase Data Flow task performance.
When you apply a table or view or a table or view name from a variable in the OLE DB source data-visiting mode, be careful that it behaves like a select *, and all columns are converted, and you can apply this approach when you have a compelling need to transform all the columns in the source table into the target table.
Best practices 3:oledb The impact of goal setting
The following is a set of OLE DB target settings that affect data conversion performance:
Data access mode: This setting provides the "Fast load" option, which applies the BULK INSERT statement to the target table instead of simply applying the INSERT statement (one row at a time), so do not change this fast load default option unless you have special needs.
Maintain consistency: The default setting is not reflective, which means that the target table (if it has an identity column) will create its own identity value, and if you reflect on this setting, the data flow engine will ensure that the source identity value is protected and the same value is inserted into the target table.
Maintain null: The default setting is also not reflective, which means that null values from the source table are inserted into the target table.
Table Lock: The default setting is to reflect on, it is recommended to maintain the default settings, unless there are other processes at the same time to apply the same table, specifying a table lock will take the full table of access, rather than the table of multiple rows of access, which is likely to trigger a chain reaction.
Introspection constraint: The default setting is to reflect, if you can ensure that the data written does not violate the constraints on the target table, it is recommended not to reflect on, this setting will specify the data flow pipeline engine to verify the data written to the target table, if you do not reflect on the constraints, performance will be greatly improved, because the cost of introspection is eliminated.
Best Practice 4: The number of rows inserted per batch and the effect of the maximum insert size setting
Number of rows inserted per batch: The default value for this setting is-1, which means that each input line is considered to be a batch, you can change this default behavior, insert all rows into multiple batches, and the value only allows positive integers, which specifies the maximum number of rows to be included in each batch.
Maximum Insert commit size: The default value for this setting is "2147483647", which specifies the maximum number of rows to commit at a time, you can change this value, note that if this value is set too small, it will result in an increase in the number of commits, but this will release the transaction log and tempdb pressure, because the bulk of the data inserted, The pressure on the transaction log and tempdb is very large.
The above two settings are very important for understanding the performance of improved tempdb and transaction logs, for example, if you maintain the default value of the maximum insert commit size, the transaction log and tempdb will continue to grow during extraction, and if you transmit large amounts of data, the memory will consume light quickly and the extraction will fail. It is therefore best to set a reasonable value for it based on your own environment.
Note: The above recommendations benefit from my years of experience with DTS and SSIS applications, but as shown earlier, there are other factors that affect performance, such as infrastructure and network environments, so it's a good idea to do a thorough test before you push these measures into production.
SQL Server Integration Services best practices: statement optimization