Dimensions of slow changes
Slowly changing dimensions
A concept in a data warehouse for dimensional modeling is slowly changing.
Dimensions, which is often abbreviated as SCD. In the real world, dimension attributes are not static.
The loss. This time-varying dimension is generally called a slow-changing dimension, and the problem of processing the historical changes of the dimension table is called the problem of processing the slow-Changing Dimension.
To handle SCD problems. This article is a solution to the slow change of data warehouse.
Definition of slow change dimension in Wikipedia:
Dimension is a term in data management and data warehousing that refers
Logical groupings of data such as geographical location, customer information,
Or product information. slowly changing dimensions (SCD) are dimensions that have
Data that slowly changes.
The general idea is that dimensions with slow data changes are called "slowly changing dimensions ".
The following is an example:
In a retail data warehouse, the fact table stores the sales records of each salesperson. One day a salesperson transferred from the Beijing branch to the Shanghai branch. How can he save this change? That's it.
It means how to properly handle this change in the sales personnel dimension. First, let's answer a question. Why should we handle or save this change? If we want to count the total sales in Beijing or Shanghai
Should a salesperson's sales record be in Beijing or Shanghai? Of course, it is in Beijing before the transfer, but after the transfer, it is in Shanghai, but if it is marked as the region of the salesperson? Here we need to handle this dimension.
Degrees of data, that is, what we need to do when we slowly change the dimension.
The following solutions are available for handling slow change dimensions:
1. New Data overwrite old data
This method must be conditional, that is, you do not care about the changes in this digital drama. For example, if the English name of a salesperson is changed, you can directly overwrite (modify) the data in the data warehouse if you do not care about the change in the employee's English name.
2. Save multiple records and add fields to differentiate
In this case, a new record is directly added, the original record is retained, and the difference is saved using a dedicated field. For example:
(Supplier_state in the following table indicates the region in the preceding example. The description is clear and does not need a proxy key)
Supplier_key supplier_code supplier_name supplier_state disable
001
ABC phlogistical supply company ca y
002 ABC phlogistical supply company Il
N
Or:
Supplier_key supplier_code supplier_name supplier_state version
001
ABC phlogistical supply company Ca 0
002 ABC phlogistical supply company Il
1
The preceding two methods are used to add data version information or to identify new and old data.
The following are the effective date and expiration date of the added record to identify the new and old data:
Supplier_key supplier_code supplier_name supplier_state start_date
End_date
001 ABC phlogistical supply company CA 01-Jan-2000
21-dec-2004
002 ABC phlogistical supply company il 22-dec-2004
An empty end_date indicates the data of the current version, or you can use a default large time (such as 12/31/9999) to replace the null value, so that the data can be recognized by the index.
3. Save different values for different fields
Supplier_key supplier_name original_supplier_state 1_tive_date
Current_supplier_state
001 phlogistical supply company ca 22-dec-2004
Il
This method saves the variation trace with different fields. however, this method cannot save all change records as in the second method. It can only save two change records. applicable to dimensions with no more than two changes.
4. Create a new table and save the history
Create another history table to store the change history, while the dimension only saves the current data.
Supplier:
Supplier_key supplier_name supplier_state
001
Phlogistical supply company Il
Supplier_history:
Supplier_key
Supplier_name supplier_state create_date
001 phlogistical supply company ca
22-dec-2004
This method only records the historical traces of changes, but it is inconvenient to start statistical operations.
V. hybrid mode
This mode is a mixture of the above models. This method is relatively more comprehensive and can better cope with complicated and easy-to-change user needs.
Row_key supplier_key supplier_code supplier_name supplier_state
Start_date end_date
Current indicator
1 001 abc001 phlogistical supply
Company ca 22-dec-2004 15-Jan-2007 n
2 001 abc001 phlogistical Supply Company
IL 15-Jan-2007 1-Jan-2099 y
This method has the following advantages:
1. You can use simple filtering conditions to select the current value of a dimension.
2. It is easy to associate the value of fact data at any time in history.
3. If the fact table has some time fields (such as order date, shipping date, confirmation
Date), so we can easily select which Dimension Data for association analysis.
Where row_key and current
The indicator field is dispensable and easier to add. After all, the data in the dimension table is not big, and multi-point redundant fields do not occupy much space but can improve the query efficiency.
In this design mode, the fact table should use supplier_key as the foreign key. Although this field cannot uniquely identify a dimension data, it forms a multi-to-many relationship between the fact table and the dimension table, therefore, the timestamp field (or indicator field) should be added when associating facts and dimensions ).
Vi. unconventional hybrid mode
The fifth implementation method mentioned above has some drawbacks, that is, fact tables and dimension tables do not have many-to-one relationships, but many-to-many relationships. Such relationships cannot be solved only on the report Layer During modeling, it is complicated to add time filter conditions when creating a bi semantic layer during report running.
The following solution can solve the many-to-many relationship, but you must modify the fact table:
Supplier dimension:
Version_number supplier_key supplier_code
Supplier_name supplier_state start_date end_date
1 001 abc001 phlogistical
Supply Company ca 22-dec-2004 15-Jan-2007
0 001 abc001 phlogistical supply
Company IL 15-Jan-2007 1-Jan-2099
Fact delivery: (the description is clear and the agent key is not used to identify the dimension)
Delivery_key supplier_key supplier_version_number quantity Product
Delivery_date order_date
1 001 0 132 bags 22-dec-2006 15--200-2006
2 001 0
324 chairs 15-Jan-2007 1-Jan-2007
In this solution, the current data version number in the dimension table is always 0, that is, when the Dimension Data is inserted, the version_number of the old version is changed to 1 (incrementing), and then the current data is inserted, in this case, the current data version is always 0.
When inserting data in a fact table, all Dimension Data versions are always 0.
Therefore, this solution can completely solve the problem of multi-to-Multi-Relationship between fact tables and dimension tables. In addition, it can ensure the integrity of reference between fact tables and dimension tables, and we are using Erwin, in modeling by powerdesigner and other modeling tools, version_number and supplier_key can be used as composite primary keys to establish links between two entities.
A concept in a data warehouse for dimensional modeling is slowly changing.
Dimensions, which is often abbreviated as SCD. In the real world, dimension attributes are not static.
The loss. This time-varying dimension is generally called a slow-changing dimension, and the problem of processing the historical changes of the dimension table is called the problem of processing the slow-Changing Dimension.
To handle SCD problems. This article is a solution to the slow change of data warehouse.
Definition of slow change dimension in Wikipedia:
Dimension is a term in data management and data warehousing that refers
Logical groupings of data such as geographical location, customer information,
Or product information. slowly changing dimensions (SCD) are dimensions that have
Data that slowly changes.
The general idea is that dimensions with slow data changes are called "slowly changing dimensions ".
The following is an example:
In a retail data warehouse, the fact table stores the sales records of each salesperson. One day a salesperson transferred from the Beijing branch to the Shanghai branch. How can he save this change? That's it.
It means how to properly handle this change in the sales personnel dimension. First, let's answer a question. Why should we handle or save this change? If we want to count the total sales in Beijing or Shanghai
Should a salesperson's sales record be in Beijing or Shanghai? Of course, it is in Beijing before the transfer, but after the transfer, it is in Shanghai, but if it is marked as the region of the salesperson? Here we need to handle this dimension.
Degrees of data, that is, what we need to do when we slowly change the dimension.
The following solutions are available for handling slow change dimensions:
1. New Data overwrite old data
This method must be conditional, that is, you do not care about the changes in this digital drama. For example, if the English name of a salesperson is changed, you can directly overwrite (modify) the data in the data warehouse if you do not care about the change in the employee's English name.
2. Save multiple records and add fields to differentiate
In this case, a new record is directly added, the original record is retained, and the difference is saved using a dedicated field. For example:
(Supplier_state in the following table indicates the region in the preceding example. The description is clear and does not need a proxy key)
Supplier_key supplier_code supplier_name supplier_state disable
001
ABC phlogistical supply company ca y
002 ABC phlogistical supply company Il
N
Or:
Supplier_key supplier_code supplier_name supplier_state version
001
ABC phlogistical supply company Ca 0
002 ABC phlogistical supply company Il
1
The preceding two methods are used to add data version information or to identify new and old data.
The following are the effective date and expiration date of the added record to identify the new and old data:
Supplier_key supplier_code supplier_name supplier_state start_date
End_date
001 ABC phlogistical supply company CA 01-Jan-2000
21-dec-2004
002 ABC phlogistical supply company il 22-dec-2004
An empty end_date indicates the data of the current version, or you can use a default large time (such as 12/31/9999) to replace the null value, so that the data can be recognized by the index.
3. Save different values for different fields
Supplier_key supplier_name original_supplier_state 1_tive_date
Current_supplier_state
001 phlogistical supply company ca 22-dec-2004
Il
This method saves the variation trace with different fields. however, this method cannot save all change records as in the second method. It can only save two change records. applicable to dimensions with no more than two changes.
4. Create a new table and save the history
Create another history table to store the change history, while the dimension only saves the current data.
Supplier:
Supplier_key supplier_name supplier_state
001
Phlogistical supply company Il
Supplier_history:
Supplier_key
Supplier_name supplier_state create_date
001 phlogistical supply company ca
22-dec-2004
This method only records the historical traces of changes, but it is inconvenient to start statistical operations.
V. hybrid mode
This mode is a mixture of the above models. This method is relatively more comprehensive and can better cope with complicated and easy-to-change user needs.
Row_key supplier_key supplier_code supplier_name supplier_state
Start_date end_date
Current indicator
1 001 abc001 phlogistical supply
Company ca 22-dec-2004 15-Jan-2007 n
2 001 abc001 phlogistical Supply Company
IL 15-Jan-2007 1-Jan-2099 y
This method has the following advantages:
1. You can use simple filtering conditions to select the current value of a dimension.
2. It is easy to associate the value of fact data at any time in history.
3. If the fact table has some time fields (such as order date, shipping date, confirmation
Date), so we can easily select which Dimension Data for association analysis.
Where row_key and current
The indicator field is dispensable and easier to add. After all, the data in the dimension table is not big, and multi-point redundant fields do not occupy much space but can improve the query efficiency.
In this design mode, the fact table should use supplier_key as the foreign key. Although this field cannot uniquely identify a dimension data, it forms a multi-to-many relationship between the fact table and the dimension table, therefore, the timestamp field (or indicator field) should be added when associating facts and dimensions ).
Vi. unconventional hybrid mode
The fifth implementation method mentioned above has some drawbacks, that is, fact tables and dimension tables do not have many-to-one relationships, but many-to-many relationships. Such relationships cannot be solved only on the report Layer During modeling, it is complicated to add time filter conditions when creating a bi semantic layer during report running.
The following solution can solve the many-to-many relationship, but you must modify the fact table:
Supplier dimension:
Version_number supplier_key supplier_code
Supplier_name supplier_state start_date end_date
1 001 abc001 phlogistical
Supply Company ca 22-dec-2004 15-Jan-2007
0 001 abc001 phlogistical supply
Company IL 15-Jan-2007 1-Jan-2099
Fact delivery: (the description is clear and the agent key is not used to identify the dimension)
Delivery_key supplier_key supplier_version_number quantity Product
Delivery_date order_date
1 001 0 132 bags 22-dec-2006 15--200-2006
2 001 0
324 chairs 15-Jan-2007 1-Jan-2007
In this solution, the current data version number in the dimension table is always 0, that is, when the Dimension Data is inserted, the version_number of the old version is changed to 1 (incrementing), and then the current data is inserted, in this case, the current data version is always 0.
When inserting data in a fact table, all Dimension Data versions are always 0.
Therefore, this solution can completely solve the problem of multi-to-Multi-Relationship between fact tables and dimension tables. In addition, it can ensure the integrity of reference between fact tables and dimension tables, and we are using Erwin, in modeling by powerdesigner and other modeling tools, version_number and supplier_key can be used as composite primary keys to establish links between two entities.