This article discusses two common methods in data warehouse model design. In the application environment of data warehouses, there are two types of load: one is to answer repetitive questions, and the other is to answer interactive questions. Dynamic query has obvious interactive features. This interaction process is often called data mining or knowledge exploration.
Features of the Data Warehouse Model
For traditional OLTP systems, we always establish their models based on applications. In other words, OLTP systems are application-oriented. The data warehouse is generally modeled based on the topic, which is Subject-oriented. What is an application? What is the subject? Let's look at a simple example.
In banks, there are generally various business systems such as private (personal savings), Public (corporate savings), and credit cards. They are all application-oriented, the supported transaction types are simple and fixed. Due to implementation, these systems may run on different platforms and have no relationship with each other. data redundancy exists between systems. For example, each system has customer data. When establishing a data warehouse application for a bank, the data in the above production system should be converted to the data warehouse. From the perspective of the entire bank, its data model is no longer oriented to individual applications, but to the subject of the entire bank, such as customers, products, channels, etc. Therefore, information related to customers, products, and channels in each production system will be converted to the corresponding topics in the data warehouse, in this way, a consistent information view is provided on the banking business interface.
Data warehouse modeling method
Logical modeling is an important part of data warehouse implementation, because it can directly reflect the needs of the business department, but also has an important guiding role for the physical implementation of the system. Currently, two commonly used modeling methods are the so-called 3NF (Third Normal Form) and Star-Schema ), we will focus on the characteristics of the two methods and their applicability in the data warehouse system.
What is the third paradigm
The paradigm is the basic theory of database logic model design. A relational model can be decomposed without loss from the first paradigm to the fifth paradigm. This process is also called Normalize ). In the model design of data warehouse, the third paradigm is generally used, which has a very strict mathematical definition. From the meaning of its expression, a relationship that conforms to the third paradigm must have the following three conditions:
1. EachThe attribute value is unique and does not have ambiguity;
2. EachThe non-primary attribute must be completely dependent on the entire primary key.Instead of a portion of the primary key;
3. EachNon-primary attributes cannot depend on attributes in other relationships.In this case, this property should be included in other relationships.
We can see that the definition of the third paradigm is basically based on the relationship between the primary key and the primary attribute. If only the first condition is met, it is called the first paradigm; if the first two conditions are met, it is called the second paradigm, and so on. Therefore, Paradigms at all levels are backward compatible.
What is star mode?
The star schema is a multidimensional data relationship that consists of a Fact Table and a set of dimension tables.. Each dimension table has a dimension as the primary key. All these dimensions are combined into the primary key of the fact table. In other words, each element of the primary key of the fact table is a foreign key of the dimension table. The non-primary attribute of a Fact table is called a Fact (Fact). They are generally numerical values or other data that can be computed. Dimension Data is of the text, time, and other types.
Application of the third paradigm and Star Model in data warehouse
The basic structure of a data warehouse can be divided into four layers:
Some enterprises do not establish a data warehouse for the entire enterprise, but establish an independent data mart Based on department applications (comparison between data mart and data warehouse) for such reasons, please refer to the article prepared by the author from Bill Inmon in the 27th issue of this year ).
Most people design the logic model of the Central Data warehouse according to the third paradigm. In physical implementation, due to the limitations of the database engine, we have to perform non-standard processing (De-Normalize) on the logical model to improve the system response speed, this is of course at the cost of increasing system complexity, maintenance workload, and disk usage ratio (the ratio of raw data to disk size) and reducing the system's ability to execute dynamic queries.
According to the data warehouse test standard TPC-D specification, in the data warehouse system, the biggest challenge to the database engine is mainly the following operations: multi-table join, table accumulation, data sorting, and massive data scanning. The following lists some of the compromise measures adopted by DBMS in the actual system to address these difficulties:
1. How to Avoid Multi-table Join: when designing a model, tables are merged, that is, Pre-Join ). When the data size is small, you can also use the star mode to increase the system speed, but increase the amount of data redundancy.
2. How to Avoid table accumulation: Add Summarized Data items to the model. In this way, data redundancy is also added, and temporary adjustments are required if a problem is not in the pre-built accumulation.
3. How to avoid data sorting: Sort data in advance. However, as the data warehouse system continues to run and new data is added, the work of database administrators will be greatly increased. A large amount of time will be used to sort out the system, and the availability of the system will decrease.
4. How to avoid large table scanning: by using a large number of indexes, you can avoid scanning a large amount of data. However, this will increase the complexity of the system and reduce the ability of the system to perform dynamic queries.
Most of these measures are nonstandard. According to the above discussion, when the standard system logic model is physically implemented, due to the restrictions of the database engine, non-standard processing is often required. For example, when the system data volume is small, for example, only a few GB, the response time for complex queries such as multi-table join is tolerable. However, if the data volume is expanded to several hundred gigabytes or even terabytes, a table usually contains millions, tens of millions, or even more records, in this case, complex queries such as multi-table join are unable to respond for a long time. In this case, it is necessary to merge several tables to minimize the number of table join operations. Of course, the degree of nonstandard processing depends on the parallel processing capability of the database engine. When selecting a database engine, you should be able to design a test scheme based on your actual situation, in addition to referring to some benchmark test results, select one of the database systems that best suits your enterprise's decision-making requirements.
Non-standard processing stage
Now let's discuss the stage at which to proceed when we have to choose non-standard processing.
Because the data model of the Central Data Warehouse reflects the business operation rules of the entire enterprise, non-standard processing here may easily affect the entire system, which is not conducive to future expansion. Moreover, the data redundancy caused by nonstandard processing will rapidly increase the data volume of the entire system, which will increase DBA workload and system investment. Therefore, when the system performance declines and non-standard processing is performed, it is better to select a Department data mart with more concentrated problems to implement such measures. This can effectively improve the system performance without affecting the entire system. This method is basically used in some successful foreign enterprise-level data warehouse cases.
So can we use the star schema for Model Design in the Central Data Warehouse? We know that there is a fact table and a group of dimension tables in the star mode. we can regard the fact as the value at each dimension intersection. For example, a car factory can examine the model, color, agent and other factors of a car when studying its sales situation. These factors are dimensions, and sales are facts. This multidimensional model can quickly provide reports based on each dimension, which must be determined in advance.
The star mode is fast because a large amount of preprocessing is performed on each dimension, such as pre-statistics, classification, and sorting by dimension. In the preceding example, the sales volume is calculated based on the vehicle model, color, and agent. Therefore, in a data warehouse designed in the star mode, the report generation speed is fast, but the modeling process is relatively slow due to a large amount of preprocessing. When business problems change and the original dimensions cannot meet the requirements, new dimensions need to be added. Because the primary key of a fact table is composed of the primary keys of all dimension tables, this dimension change is very complex and time-consuming. Another notable disadvantage of the star schema is the large amount of data redundancy. Based on these discussions, it is not difficult to conclude that the star schema is more suitable for pre-defined problems, such as the need to generate a large number of reports; it is not suitable for scenarios where there are many dynamic queries, high system scalability requirements, or a large amount of data. Therefore, the star schema has many applications in the Department data mart that requires a large number of reports.
Summary
The two methods used in data warehouse model design are discussed above. In the application environment of data warehouses, there are two types of load: one is to answer repetitive questions, and the other is to answer interactive questions. Dynamic query has obvious interactive features, that is, further exploration based on the answer to a question. This interaction process is often called Data Mining or Knowledge Discovery ). For the Department data mart with the first load, the star mode can be used when the data volume is small and the reports are relatively fixed. For the central data warehouse, considering the scalability, investment cost, and ease of management of the system, it is best to adopt the third paradigm.