A concept in a data warehouse for dimensional modeling is slowly changing dimensions, which is often abbreviated as SCD. The slow change dimension is proposed because in the real world, the dimension attributes are not static and will change slowly with the loss of time. This time-varying dimension is generally called a slow-changing dimension, and the problem of processing the historical changes of the dimension table is called the problem of processing the slow-Changing Dimension, it is also referred to as SCD. !>! -->! [Endif] -->! -->! [Endif] -->! -->
There are usually three methods to process slow-changing dimensions.
I-when the data of a dimension Member changes, the latest column value will overwrite the previous dimension record, thus clearing the historical records of the Dimension member.
II-when the data of a dimension Member changes, the latest column value is stored as a new record in the dimension, providing multiple instances of a dimension member, so that the historical records are retained.
III-when the column data of a dimension Member changes and the Data Warehouse wants to retain the last version of the change column, the original data will be moved to the last version column of the dimension record, all new dimension information overwrites existing columns.
Among the above three methods, the I processing is the simplest, but will lead to distortion of the analysis results. For example, when the Department of a salesperson changes, if the first method is used, when processing the drill-up and drill-down operations between departments and sales personnel, it will lead to an aggregation error, bringing the sales performance of sales personnel in the original department to the new department.
When using method II, you need to add a proxy key. Note that you have corresponding processing skills for different data warehouse platforms. Otherwise, unreasonable situations may occur. For example, in Microsoft Analysis service2005, when a salesperson is transferred from department a to department B and transferred back to department a, if the key of the dimension attribute is set as the agent key, A salesperson may appear twice in department.
Method iii may involve dynamic changes to the data warehouse model, which brings about a large amount of overhead, so it is not often used in actual data warehouse implementation.
In general, policies that are used in combination with method I and method II are used to directly overwrite descriptive attributes of a dimension, the second method is used for attributes that will appear in layers and may participate in aggregate computing.
========================================================== ======================================
When the Department shared this evening, the speaker spoke about this point. He gave the answer to the question of how to handle slow changes and dimensions: "select an appropriate method based on business needs, there is no correct answer"
###########################
Definition of slowly changing dimensions
Definition in Wikipedia:
Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location, customer information, or product information.
Slowly changing dimensions (SCD) are dimensions that have data that slowly changes.
The dimension in which data changes slowly is called "slowly changing dimensions ".
The following is an example:
In a retail data warehouse, the fact table stores the sales records of each salesperson. One day a salesperson transferred from the Beijing branch to the Shanghai branch. How can he save this change? That is to say, how to properly handle this change in the sales personnel dimension. First, let's answer a question. Why should we handle or save this change? If we want to count the total sales in Beijing or Shanghai, should the sales records of this salesperson be calculated in Beijing or Shanghai? Of course, it is in Beijing before the transfer, but after the transfer, it is in Shanghai, but if it is marked as the region of the salesperson? Here we need to process the data of this dimension, that is, what we need to do when we slowly change the dimension.
The following solutions are available for handling slow change dimensions:
I. New Data overwrites old data
This method must be conditional, that is, you do not care about the changes in this digital drama. For example, if the English name of a salesperson is changed, you can directly overwrite (modify) the data in the data warehouse if you do not care about the change in the employee's English name.
2. Save multiple records and add fields to differentiate
In this case, a new record is directly added, the original record is retained, and the difference is saved using a dedicated field. For example:
(Supplier_state in the following table indicates the region in the preceding example. The description is clear and does not need a proxy key)
Supplier_key supplier_code supplier_name supplier_state disable
001 ABC phlogistical supply company ca y
002 ABC phlogistical supply company il n
Or:
Supplier_key supplier_code supplier_name supplier_state version
001 ABC phlogistical supply company Ca 0
002 ABC phlogistical supply company IL 1
The preceding two methods are used to add data version information or to identify new and old data.
The following are the effective date and expiration date of the added record to identify the new and old data:
Supplier_key supplier_code supplier_name supplier_state start_date end_date
001 ABC phlogistical supply company CA 01-Jan-2000 21-dec-2004
002 ABC phlogistical supply company il 22-dec-2004
An empty end_date indicates the data of the current version, or you can use a default large time (such as 12/31/9999) to replace the null value, so that the data can be recognized by the index.
3. Save different values for different fields
Supplier_key supplier_name original_supplier_state 1_tive_date current_supplier_state
001 phlogistical supply company ca 22-dec-2004 Il
This method saves the variation trace with different fields. however, this method cannot save all change records as in the second method. It can only save two change records. applicable to dimensions with no more than two changes.
4. Create a table to save historical records
Create another history table to store the change history, while the dimension only saves the current data.
Supplier:
Supplier_key supplier_name supplier_state
001 phlogistical supply company Il
Supplier_history:
Supplier_key supplier_name supplier_state create_date
001 phlogistical supply company ca 22-dec-2004
This method only records the historical traces of changes, but it is inconvenient to start statistical operations.
V. hybrid mode
This mode is a mixture of the above models. This method is relatively more comprehensive and can better cope with complicated and easy-to-change user needs.
Row_key supplier_key supplier_code supplier_name supplier_state start_date end_date current indicator
1 001 abc001 phlogistical supply company ca 22-dec-2004 15-Jan-2007 n
2 001 abc001 phlogistical supply company IL 15-Jan-2007 1-Jan-2099 y
This method has the following advantages:
1. You can use simple filtering conditions to select the current value of a dimension.
2. It is easy to associate the value of fact data at any time in history.
3. If the fact table contains some time fields (such as order date, shipping date, and confirmation date), we can easily select which Dimension Data for association analysis.
The row_key and current indicator fields are dispensable and easier to add. After all, the data in the dimension table is not large. Multi-Point redundant fields do not occupy much space but can improve the query efficiency.
In this design mode, the fact table should use supplier_key as the foreign key. Although this field cannot uniquely identify a dimension data, it forms a multi-to-many relationship between the fact table and the dimension table, therefore, the timestamp field (or indicator field) should be added when associating facts and dimensions ).
6. unconventional hybrid mode
The fifth implementation method mentioned above has some drawbacks, that is, fact tables and dimension tables do not have many-to-one relationships, but many-to-many relationships. Such relationships cannot be solved only on the report Layer During modeling, it is complicated to add time filter conditions when creating a bi semantic layer during report running.
The following solution can solve the many-to-many relationship, but you must modify the fact table:
Supplier dimension:
Version_number supplier_key supplier_code supplier_name supplier_state start_date end_date
1 001 abc001 phlogistical supply company ca 22-dec-2004 15-Jan-2007
0 001 abc001 phlogistical supply company IL 15-Jan-2007 1-Jan-2099
Fact delivery: (the description is clear and the agent key is not used to identify the dimension)
Delivery_key supplier_key supplier_version_number quantity product delivery_date order_date
1 001 0 132 bags 22-dec-2006 15--200-2006
2 001 0 324 chairs 15-Jan-2007 1-Jan-2007
In this solution, the current data version number in the dimension table is always 0, that is, when the Dimension Data is inserted, the version_number of the old version is changed to 1 (incrementing), and then the current data is inserted, in this case, the current data version is always 0.
When inserting data in a fact table, all Dimension Data versions are always 0.
Therefore, this solution can completely solve the problem of multi-to-Multi-Relationship between fact tables and dimension tables. In addition, it can ensure the integrity of reference between fact tables and dimension tables, and we are using Erwin, in modeling by powerdesigner and other modeling tools, version_number and supplier_key can be used as composite primary keys to establish links between two entities.