SSIS Best Practice: SQL Server improves execution performance

Source: Internet
Author: User
Tags bulk insert ole ssis

Original from [Wind letter network], reprinted Please retain the original link: http://www.ithov.com/server/94832.shtml

SQL Server integration services (SSIS) has improved a lot on the basis of its predecessor DTS (data transformation services, in terms of availability, performance, and parallelism, it has grown into an enterprise-level ETL (extraction, transformation and loading, extraction, conversion, and loading) product, except for an ETL product, it also provides various built-in tasks to manage SQL Server instances. Although the internal architecture of SSIS has been designed to provide excellent performance and parallel processing capabilities, its performance can be further optimized if best practices are followed. In this series of articles, I will discuss SSIS best practices. I will share my experiences in SSIs over the past few years. As mentioned above, SSIS is an alternative to DTS (SQL Server 7/2000). If you have used DTS, you will find that the SSIS package is very similar to the DTS package, however, in essence, there have been great changes. SSIS is not an enhanced version of DTs, but a new product built from scratch. Compared with DTS, SSIS provides better performance and parallel processing capabilities, and overcomes many limitations of DTs. SSIS 2008 further enhances the Internal Data Stream pipeline engine and provides better performance. You may have seen an ETL world record created by SSIS 2008, that is, loading 1 TB of data within half an hour. The biggest benefit of SSIS is that it is a component of SQL Server, which can be obtained free of charge with the installation of SQL Server and no additional license is required for it, bi developers, database developers, and DBAs can use it to convert data. Best Practice 1: extract large volumes of data recently we have extracted data from a large table with 0.3 billion records. At first, when the SSIS package was started, everything went normal and the data was being converted as expected, however, the performance began to decline gradually, and the data conversion rate plummeted. Through analysis, we found that the target table has a primary clustering key and two non-clustering keys. Because a large amount of data is inserted into this table, the index fragmentation level reaches 85%-90%. We use the online index reconstruction feature to re-create/reorganize indexes. However, during the loading period, the index fragmentation level returns to 90% every 15-20 minutes, the process of final data conversion and parallel execution of Online Index reconstruction took 12-13 hours, far exceeding our expectation. We have come up with a way to delete all the indexes in the target table before the conversion starts, and re-create the index after the conversion ends, the entire conversion process took 3-4 hours, fully in line with our expectations. The whole process is shown in the following figure. Therefore, we recommend that you delete all indexes on the target table before inserting data, especially when inserting large data volumes. Before data conversion, delete all indexes on the target table. After the conversion, rebuild the index. Best Practice 2: avoid using the data flow task (DFT) of select * SSIs) A buffer zone is used as a transfer station for data transmission and conversion. when data is transmitted from the source table to the target table, the data first enters the buffer zone, and the data conversion is completed in the buffer zone, it is written into the target table only after the conversion is completed. The buffer size is limited by the server hardware. It must estimate the size of the row. The row size is obtained by the sum of the maximum values of all columns in a row. Therefore, the more columns, this means that there will be fewer rows into the buffer, and more requirements on the buffer, resulting in lower performance. Therefore, it is best to specify the columns to be converted to the target table during conversion. Even if you need all columns in the source table, you should specify the column name in the SELECT statement. If you use select *, it will wrap around the metadata of the collection column in the source table, the execution time of SQL statements is naturally longer. If you convert unnecessary columns in the target table, a warning message is displayed, for example, [SSIs. pipeline] Warning: the output column "salariedflag" (64) on output "ole db source output" (11) and component "ole db source" (1) is not subsequently used in the data flow task. 0. 2 removing this unused output column can increase data flow task performance. 0. 2 [SSIs. pipeline] Warning: the output column "currentflag" (73) on output "ole db source output" (11) And component "ole db source" (1) is not subsequently used in the data flow task. 0. 2 removing this unused output column can increase data flow task performance. 0. 2 when using the "table or view" or "table name or view name from a variable" Data Access Mode in the oledb source, be careful, similar to select *, it converts all columns. You can use this method to convert all columns in the source table to the target table. Best Practice 3: The Impact of oledb target settings is a set of oledb target settings that affect data conversion performance: Data Access Mode: This setting provides the "fast loading" option, it uses the bulk insert statement to write data to the target table, instead of simply using the insert Statement (insert a row each time). Therefore, unless you have special requirements, do not change this default option for fast loading. Consistency: the default settings are not checked. This means that the target table (if it has an ID column) will create its own id value. If you check this setting, the data stream engine ensures that the source ID value is protected and inserts the same value into the target table. Keep NULL: the default setting will not be checked, which means that the null values from the source table will be inserted into the target table. Table lock: the default settings are checked. We recommend that you keep the default settings. Unless other processes use the same table at the same time, specifying a table lock will gain access to the entire table, instead of the access to multiple rows in the table, this may trigger a chain reaction. Check constraints: check is performed by default. If you can ensure that the written data does not violate the constraints on the target table, we recommend that you do not check the data, this setting will specify the Data Stream pipeline engine to verify the data written to the target table. If you do not check the constraints, the performance will be greatly improved because the inspection overhead is saved. Best Practice 4: the number of rows inserted in each batch and the number of rows inserted in each batch affected by the maximum insert size setting: the default value of this setting is-1, this means that each input row is regarded as a batch. You can change this default behavior and divide all rows into multiple batch inserts. The value can only be a positive integer, it specifies the maximum number of rows contained in each batch. Maximum insert and submit size: the default value of this setting is "2147483647", which specifies the maximum number of lines submitted at a time. You can modify this value. Note that if this value is set to too small, this will increase the number of commits, but will release the pressure on transaction logs and tempdb, because the pressure on transaction logs and tempdb is very high when a large volume of data is inserted. The above two settings are very important to understand how to improve the performance of tempdb and transaction logs. For example, if you keep the default value of the maximum insert commit size, the transaction log and tempdb will continue to increase during the extraction process, if you transmit a large volume of data, the memory will soon consume light and the extraction will fail. Therefore, you 'd better set a reasonable value for it based on your environment. Note: The above suggestions benefit from my years of experience in DTS and SSIs, but as shown above, there are other factors that affect performance, such as infrastructure and network environments. Therefore, before you push these measures to the production environment, you 'd better perform a thorough test. Best practice 5: SQL Server target adapter-0-2 if your target is a local database, we recommend that you use the SQL Server target adapter, it provides data insertion performance similar to the bulk insert task and some additional enhancements. You can use the SQL Server target adapter to convert data before writing data to the target table, but the BULK INSERT task does not work. Apart from the available options of the oledb target adapter, more options are available for the target SQL server adapter, as shown in figure 1. For example, you can specify whether to trigger the insert trigger on the target table. By default, this option is set to "false", which means that the trigger on the target table is not triggered. If the trigger is enabled, this will cause performance degradation, but triggers are inevitable to enforce data and business rules. In addition, you can specify the number of first and last loads, the maximum number of errors, and the sequence of inserted columns. Figure 1 SQL Server target adapter option Figure 2 oledb target adapter option if your SQL Server database is on a remote server, you cannot use the SQL Server target adapter. In this case, you can only use the oledb target adapter. In addition, if the target database may be changed from local to remote, or from one database instance to another, it is best to use the oledb target adapter to reduce possible changes in the future. Best Practice 6: Avoid asynchronous conversions as much as possible. 2. Before discussing the impact of different types of conversions on performance, let's briefly review the working principles of SSIs, the SSIS Runtime Engine execution package. When the SSIS Runtime Engine encounters a Data Flow task, it submits the data flow task to the data flow pipeline engine, the Data Stream pipeline engine Splits a data stream task into multiple execution trees, and may execute two or more execution trees at the same time to improve the concurrent processing capability and performance. You may not know what the execution tree is. below is the answer. Like its name, the execution tree has a similar structure as the tree. Each execution tree has a buffer, and its range is closely related to the execution tree. Each execution tree also has an operating system thread, which is different from the buffer, A thread may be shared with other execution trees, that is, one thread can execute one or more execution trees. In SSIS 2008, the process of splitting a Data Flow task into an execution tree has been greatly enhanced. It can create an execution path and a sub-path, so that your package can use the high-end multi-processor system. Synchronous conversion obtains a record, which is processed and then transmitted to another conversion process or the next target. The record processing does not depend on other incoming rows, because the number of records output by synchronous conversion is the same as the number of records input, it does not need a new buffer (processing is completed in the same inbound buffer ), this is because it is very fast. For example, a column is added to each inbound row during the derived column conversion process, but the number of output records is not increased. Different from Synchronous conversion, the number of records output by asynchronous conversion may be different from the number of records input. A new buffer needs to be created because one output depends on one or more records, it is also called blocking conversion. For example, sort conversion is a non-blocking conversion, which requires that all inbound rows must arrive before processing. As discussed above, asynchronous conversion requires an additional buffer for output and will not reuse the inbound input buffer. Before processing, it will wait for all inbound rows to arrive, this is also the cause of slow asynchronous conversion, so we should try to avoid this situation. For example, if there is no sort conversion, you can use the order by clause to obtain the sorted results from the source table itself. Best Practice 7: defabuffbuffermaxsize and defaultbuffermaxrows, as I mentioned in best practice 6, the execution tree creates a buffer for inbound data sorting and execution transformation. How much buffer should be created? How many rows of data are in a single buffer? What is its impact on performance? The size of the buffer zone depends on how many rows of Data enter the buffer zone, and how many rows of Data enter the buffer zone depends on other factors. The first consideration is to evaluate the size of each row, which is equal to the maximum size of all columns contained in all inbound rows, and the second consideration is the defaultbuffermaxsize attribute of the data flow task, it specifies the default maximum size of a buffer. The default value is 10 MB. The upper and lower limits of the buffer are restricted by two Internal Attributes of SSIs, namely maxbuffersize (100 MB) and minbuffersize (64 KB), which means that the size range of a buffer is 64 KB to 100 mb. The third factor is defaultbuffermaxrows, which is also an attribute of a Data Flow task, it specifies the default number of rows into the buffer. The default value is 10000. Although SSIS provides so many attributes, it can set a suitable buffer size. If the size exceeds the defaultbuffermaxsize value, it will reduce the number of records entering the buffer. To improve the performance of the buffer, You can first remove unnecessary columns from the source and set the correct data type for each column, especially when your source is a flat file, this allows the buffer to accommodate as many record rows as possible. Second, if your system has enough memory, you can adjust these attributes, create a small number of large buffers at last, which will improve the performance. Note: If you change the value of these attributes to the critical value at the beginning of a page, it will adversely affect the performance. Therefore, before you set these attributes, first, you should conduct a comprehensive test in your environment and finally find a suitable value. You can enable the log of the buffersizetuning event, so that you can see the number of rows entering the buffer. You can also monitor the "buffers spooled" performance counter to check whether SSIS has started paging. Best practice 8: buffertempstoragepath and blobtempstoragepath if the memory resources are insufficient, Windows will trigger a notification event with a low memory usage, memory overflow, memory pressure, and input records, except blob, SSIS will output them to the file system. The location of the file system is set by the buffertempstoragepath attribute of the data flow task. The default value is null. In this case, the output location is based on the location specified by the temp/tmp system variable. Similarly, before sending BLOB data to the target, SSIS may write it to the file system first, because BLOB data is usually very large and cannot be stored in the SSIS buffer, the output location is set by the blobtempstoragepath attribute of the data flow task. The default value is null. In this case, the output location is also based on the temp/tmp system variable, if you do not specify specific values for these attributes, the values of temp and TMP system variables will be used as output targets. If you enable the log of the pipelineinitialization event of the data flow task, the same information is recorded in a log file, such as: User: pipelineinitialization, ARSHADALI-LAP, Fareast \ arali, data flow task, {C80814F8-51A4-4149-8141-D840C9A81EE7}, {D1496B27-9FC7-4760-821E-80285C33E74D}, 10/11/2009 1:38:10 am, 10/11/2009 1:38:10 AM, 0, 0x, no temporary BLOB data storage locations were provided.
The buffer manager will consider the directories in the temp and TMP environment variables. now the most important thing is to change the default values of buffertempstoragepath and blobtempstoragepath. It is best to set them to different disk paths, which can improve I/O efficiency and improve overall performance. Best Practice 9: make good use of the delayvalidation attribute SSIS use verification to determine whether the package will fail at run time. It uses two types of verification. The first is package verification, before starting to execute the package, verify the package and all its components. The second is component verification, which verifies all components in the package at the beginning. Assume that the first component in the package creates an object, such as a temporary table. The second component in the package references this temporary table. During the package validation process, the first component has not been executed yet, so the temporary table has not been created. When the second component is verified, package verification fails. SSIS throws a verification exception and does not start package execution. How do you handle the packages in this scenario? To solve the problem in this scenario, each component has a delayvalidation attribute. The default value is "flase". If you set this attribute to "true", all verifications will be ignored, during package execution, the component is verified only at the component level. Best Practice 10: use parallel execution to improve performance SSIS achieves better performance through parallel execution of packages and data flow tasks, the parallel execution of SSIS packages and data flow tasks can be controlled by the two attributes of SSIs. Maxconcurrentexecutables: specifies the maximum number of parallel executions in a package (different tasks in the package), that is, the number of threads that can be created by the SSIS running engine. If your package is a continuous workflow, these attributes will not be different, but if your package has a parallel task, this attribute needs to be changed. The default value is-1, indicating the number of all available processors + 2, if your processor supports hyper-threading, it is the number of all logical processors plus 2. Enginethreads: maxconcurrentexecutables is used for concurrent execution of the SSIS runtime engine. enginethreads is used by the Data Pipeline engine. in SSIs 2005, the default value is 5, and in SSIs 2008, the default value is 10, this attribute specifies the number of source threads (extract data from the source) and worker threads (execute data conversion and loading). These threads are created by the Data Stream pipeline engine, manage data transmission and conversion in Data Flow tasks. If enginethreads is set to 5, a maximum of five source threads and five working threads can be created. Note that this attribute is only a recommendation for the Data Stream pipeline engine. The pipeline engine can create more or fewer threads as needed. Assume that you have a package that has five parallel data flow tasks. The value of the maxconcurrentexecutables attribute is 3. When you start executing this package, three data flow tasks are started concurrently, when any data flow task is completed, the next waiting Data Flow task starts, and so on. What happens in the data flow task is controlled by the enginethreads attribute. As mentioned in best practice 6, a data stream task will be split into multiple execution trees, and the Data Stream pipeline engine will create the source and working threads. Their quantity is equal to the value of the enginethreads attribute, that is, the number of execution trees that can be executed in parallel. If you set enginethreads to 5, your data flow task will be split into five execution trees, but it does not mean that all execution trees will be executed in parallel. You must be very careful when modifying these attributes. It is necessary to perform a thorough test before the application is applied to the production environment, because how to correctly configure these attributes is limited by the system resources, parallel Execution can improve the performance. However, improper configuration may also damage the performance because there are too many context switches from one thread to another, it is recommended that the number of threads for creating parallel execution not exceed the number of available processors. Best Practice 11: when to use time logs and when to avoid the best way to diagnose problems occurring during running when using event logs. When your code is not executed as expected, it can help a lot. Currently, almost all programming languages provide a log mechanism to identify the root cause of an exception or failure. SSIS allows you to enable the log function. It allows you to select different events and components to record logs, and you can specify the log storage location. Although logs can help you identify the root cause of the problem, they may cause performance degradation, especially if logs are used too much, the performance degradation will become more obvious, therefore, we recommend that you enable the log function only when necessary. You can dynamically set the value of the loggingmode attribute to enable or disable the log function. When you suspect that a packet has a problem, you can enable the logging function and select an appropriate event for recording. Best Practice 12: Use performance counters to monitor SSIS performance. In addition to logs for performance diagnosis, SSIS also introduces performance counters to monitor the performance of SSIS runtime and Data Stream pipeline engines. For example, The SSIS package instance counter indicates the number of SSIS packages running on the system. The row read and row write counters indicate the total number of rows from the source and the total number of rows written to the target respectively, the buffer usage and the buffer memory counter indicate the total number of created buffers and the memory size used by the buffer respectively. The buffer output is a very important counter, indicates the number of buffers written to the disk when the physical memory is insufficient, blob byte reading, blob byte writing, and blob files use counters to indicate the number of Blob byte reading, writing, and data stream engines currently used to output BLOB data during BLOB data transmission. If you upgrade from Windows Server 2003 with SQL Server and SSIS installed to Windows Server 2008, The SSIS performance counters will disappear because these performance counters are removed during the upgrade process, if you want to retrieve these performance counters, refer to this Knowledge Base Article (http://support.microsoft.com/kb/955632 ). Best Practice 13: distributed transactions in SSIs and their impact SSIS allows you to combine multiple executable programs and then execute them in one transaction through distributed transactions, you need to start the Distributed Transaction coordination processor service in the Windows Service. It sounds cool in the first day, but it may cause blocking problems, especially when the execution cycle of a task is very long, it is easy to block. For example, assume that you have a data flow task, a Web service task, and a data flow task in the queue. It takes several minutes for the first data flow task to extract data from the source, it takes several hours for a Web service task to extract data from a Web Service. The final data flow task combines the data and uploads the data to the final table. Now, if you execute these three tasks in a transaction, the resource will be locked by the first task until it ends, even if it does not need the Web
Resources required for service task execution. Therefore, even if SSIS provides the distributed transaction function, you should also use it with caution. Even if you really want to use it, you should also promptly remove the task from the group and set the isolationlevel attribute with caution. My suggestion is to try to avoid using this feature and seek other alternative solutions. Best Practice 14: The checkpoint feature helps restart SSIs. There is a cool new feature called checkpoint, which allows your package to be executed next time, start from the last failed point, which can save a lot of execution time. You can enable this feature by setting three attributes of the package (checkpointfilename, checkpointusage and savecheckpoints). In addition, you also need to set the failpackageonfailure attribute of all the tasks to be restarted to true, after this attribute is set, related information will be captured to the checkpoint file when the execution fails. The next execution will start from where the previous execution failed. How does it work? After you enable the checkpoint attribute for a package, the execution status is written to the checkpoint file. The file name and position are specified through the checkpointfilename attribute. During the next execution, the runtime engine will refer to the checkpoint file and check the last execution status before re-running the package. if it finds that the last execution status is failed, it will know where the last failure was, and then re-Execute from that place. If you delete this file before the next execution, the package will start from scratch. After this feature is enabled, you can save a lot of time for the next execution, because it will skip the successful execution step. Note that you can add a task to a checkpoint, including a data flow task, but it cannot be applied to a data flow task. That is, you can only enable it at the data flow task level, checkpoints cannot be used within a Data Flow task. Assume that you have a data flow task and set the failpackageonfailure attribute to true to add it to the checkpoint. Now, in the data flow task, you have five continuous conversions, and the first four are successfully executed, the execution fails at the fifth time. During the next execution, the execution starts from the first conversion, and the first four transformations are executed again.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.