The primary key in a dimension table usually has two choices: the Natural key (Natural key), which is already present in the business system, usually a character-type marker with a certain business meaning, which uniquely flags each of the dimension tablesRecording. Like whatOrganization's code, abbreviations, time tags, and so on. The other is the surrogate key (surrogate key), which is usuallyA numeric value given by a database system that is self-increasing, assigned in order, has no built-in meaning but can uniquely identify a dimensionInformation.
Project experience, it is recommended to use the second type, that is, the surrogate key. The reasons are as follows:
First, although the natural key can logically uniquely identify a dimension information, but it is usually a character type, and generally longer, if it is used as the primary key in the dimension table, it means that the same foreign key information is added to the fact table, and the fact table records the number of rows is often huge, Repeating this practice on multiple dimension tables causes the fact table to fall sharply because of the column widths that are too expansive.
Second, surrogate keys can be used asThe "buffering" between the data warehouse and the source system. Natural keys usually have a certain business meaning, but mercifully, the information is likely to change, such as the identity card number, from the original 15 bits into the current 18. If this primary key changes, because it acts as a foreign key in the fact table, it will inevitably have an impact on the fact table, because the existing fact record has not found a matching dimension record, which brings a lot of trouble. However, if the surrogate key is used as the primary key in the dimension table, these changes can be masked in the dimension table without any effect on the fact table (this also combines the processing of slowly changing dimensions).
Finally, from the relational efficiency, the correlation of numerical type is much faster than that of character type.
The choice of key constraint is also a noticeable problem in the design of Data Warehouse. In an OLTP system environment, the integrity of data is usually guaranteed by two ways, one is the logic guarantee of the application and the other is the constraint mechanism of the database structure itself. These two ways complement each other, and the situation in the Data Warehouse environment is completely different, the data integrity of data Warehouse is more dependent on the application, that is, the guarantee of ETL system.
First of all, ETL system running time is very long, but its structure is simple, repeatedly grasping, cleaning, transformation, loading action. Instead, an OLTP system might perform a large number of parallel business operations on a single table .
Secondly, the only entrance to the fact table is the dimension table, which realizes the ETL program according to the dimension modeling idea, only produces the inaccurate dimension information, it is impossible to produce duplicate records in the fact table. Thirdly, compared with the OLTP system, the Data Warehouse system has no interactive human-machine input interface, and there is no "man-made" error.
Therefore, it is best to remove some unnecessary constraints from the data warehouse, including primary KEY constraints, foreign key constraints, and unique index constraints, which can be implemented externally, when we are looking at data load times.
About the use of surrogate key in dimension table structure settings