Gradient Dimension Conversion and Its Implementation

Source: Internet
Author: User
Tags ssis

Gradient Dimension Conversion and Its Implementation

Author: Chen Li

In SQL Server 2005, Bi (Business Intelligence) module functions are greatly enhanced. An important module is called SQL Server Integrated Services (SSIS), which is the SQL Server database integration service, its main function is to extract, convert, and load data from the business database or OLTP database (extract-transform-load, ETL) to the data warehouse ), this allows you to perform subsequent data warehouse modeling and data analysis on business data as needed.

Slowly changing dimension is an important component of the SSIS integration service, this article mainly discusses the low efficiency of the gradient Dimension Conversion component encountered during the implementation of the actual project and proposes a solution.  About gradient Dimension ConversionFirst, we will introduce the concept of dimensions. A dimension is a hierarchical relationship that corresponds to the objective world, that is, the angle from which people observe things. In a data warehouse, multiple attributes of an object are defined as the dimensions of the object, allowing you to compare data in different dimensions. There are roughly three types of dimensions: stability, gradient, and quick change. In SSIS of SQL Server 2005, you can convert the gradient dimension. In a long time period, dimension tables often change over time in a data warehouse. To meet this requirement, you must specify a policy to handle changes for each dimension attribute. The main function of the gradient Dimension Conversion component in SSIS is to coordinate Record Update and insertion in the dimension table in the data warehouse, 1.
In SSIS, gradient Dimension Conversion provides the following functions:
  • Match the input row with the row in the search table to identify the new row and the existing row.
  • When changes are not allowed, the incoming lines containing changes are identified.
  • Identifies the inferred member record to be updated.
  • Identifies the incoming rows that contain the historical changes that require new records to be inserted and updated.
  • Detects incoming rows that contain changes that require updating existing records (including expired records.
Common usage situations include the following three types: 1) rewrite the attribute values of Dimension Records: it is easy to implement and convenient, but does not save the attribute change history, therefore, you cannot maintain the historical data of the old attribute. 2) incremental addition of dimension record rows: it is easy to distinguish the history of fact tables and accurately track the changes of dimension attributes. However, if you are targeting massive data volumes, at the same time, the conversion of all the changed historical attribute data will generate a large amount of historical data, resulting in a dramatic increase in the data volume of dimension tables. 3) Add a specified dimension column: You can easily create a new attribute relationship as required. If there are too many attributes associated with the middle, the performance may be greatly affected. In actual project work, incremental addition of dimension record rows is the most common. Incremental addition of dimension record rows-Analysis of dimension gradientAssume that the data is converted from one table (source table) to another table (target table), and the dimension record function is added incrementally for gradient Dimension Conversion. Initial Condition 2 of the source table and target table. Figure 2 Relationship between the target table column and the source table column, as shown in table 1:
Target table column Aa1 Bb1 PC3 Ee1
Source table column AA Bb CC EE
Table 1: Relationship between the target table and the source table mainly considers the following dimension gradient situations: 1) by comparing with the target table row by row, if the source table contains records not found in the target table, the data row should belong to the newly added record. In this case, the newly added record is inserted into the target table and the start time is set to the time set during execution, the end time is the data validity period. 3. 2) if there are records not found in the source table in the target table, that is, the corresponding records in the source table have been deleted, in this case, set the end time for these data rows in the target table to the time set for execution. For all records, as long as it is not the data validity period, it indicates that the marked record is invalid. 4. 3) if the records in the source table are modified after they are imported to the target table, the record will be inserted into the target table as a new row during the next import, set the start time to the execution time, end time to the maximum effective time, and set the end time of the original record in the target table to the current time, indicates that the original record has expired. 5. Transformation of gradient dimensions to ETL in actual projects Problems inIn SQL Server 2005, you can add Dimension Records incrementally and save historical records. Currently, you can use gradient Dimension Conversion, although not as intended as a gradient Dimension Conversion (slowly changing dimension, that is, the amount of changed data should be small), but there is no ideal solution for specific SSIS components. The implementation of specific projects shows that when the data update volume reaches 100,000, the efficiency of the SSIS gradient Dimension Conversion component has been significantly reduced, and when the update volume reaches millions, it is difficult to use gradient Dimension Conversion for ETL. When the existing components cannot meet the project ETL requirements, you often need to write some SQL statements, such as defining user functions or stored procedures, to ensure the smooth process of project ETL. Comparison of EfficiencyIn order to facilitate the comparison, we made a test: SQL Server 2005's gradient Dimension Conversion component and T-SQL stored procedure are for the same source table and target table operations. The source table structure is shown in Figure 1. Expand the record to 163840 rows. use SQL Server 2005 to incrementally convert the dimension and configure the component according to "add dimension record row incrementally, the fully-converted running time is 18 minutes and 22 seconds. Directly use the stored procedure to "incrementally add dimension record rows" and completely convert the running time to 9 seconds. We continued to expand the source table to 786432 rows of records, and waited patiently for nearly an hour. After tens of thousands of rows of records were imported, the records in the SSIS package increased quite slowly. Under the same conditions, it takes about 20 seconds to use the Stored Procedure for conversion. SummaryFrom the above test, we can see that the use of the gradient Dimension Conversion component is conditional. When the amount of data to be converted reaches a certain order of magnitude (such as more than 1 million records ), the conversion efficiency will be greatly reduced. Only when a small amount of historical data is updated in the source table, the gradient Dimension Conversion component will show its convenience and stability. You can also use hardware devices to improve the conversion efficiency when the gradient Dimension Conversion component is low, for example, servers with large memory, multiple CPUs, high I/O throughput, and high network bandwidth are used. However, this method obviously cannot solve the problem fundamentally, and requires more investment, which is detrimental to the implementation of projects under specific conditions. In actual projects, considering the convenience and Stability of the gradient Dimension Conversion component, you can write SQL user functions with the same functions or combine stored procedures with the conversion component. For example, in the initial stage of project data conversion, it is obviously unfavorable to insert a large amount of historical data using the gradient dimension component. In this case, you can use SQL statements with higher efficiency to convert data, when you modify or update Dimension Data in the middle of a project, you can use components to convert data in the form of jobs on a regular basis. References[1] Microsoft. SQL Server 2005 book online (May 2007) [2] Brian knight and so on, SQL Server 2005 advanced integration service programming, Tsinghua University Press, 2007 About the author

Chen Li, Bi engineer of Beijing maisiqi Technology Co., Ltd. Specializes in data warehouse modeling and ETL process development.

Link: http://www.winitpro.com.cn/html/2007/08/20070806155529-1.shtml

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.