Introduction
The business environment is changing rapidly, and so is the type of business data. The foundation of a successful data warehouse solution is flexible design, which can adapt to changing business data. Data warehouse architecture and data modeling are the core processes in warehouse design.
Data warehouse architecture
When you use a data model to capture business requirements, you have completed some of the work in the data warehouse design. However, the formal data warehouse design should begin with the data warehouse architecture.
The repository architecture is a key decision made based on some factors, these factors include the current infrastructure, business environment, the expected management and control structure, the commitment and scope to achieve work, the functions of the technical environment used by the enterprise, and available resources.
Architecture Selection
The repository architecture determines the location of the data warehouse and data mart, as well as the location where the Data Warehouse resides, or vice versa. For example, data can reside in the central location for centralized management. Alternatively, data can reside in a centralized or independently managed distributed local and/or remote location.
You have the following Architecture options:
Business-wide data warehouse
Independent data mart
Interconnected data mart
These architecture options can also be used in combination. For example, the data warehouse architecture can be physically distributed or centrally managed.
Data warehouse architecture for business scope
A data warehouse within the business scope is a data warehouse that will support the entire or most businesses. This business requires a more fully integrated data warehouse, line of business) high data access and utilization. Design and construct a Warehouse Based on the overall business needs. It can be regarded as a public repository that can support data across the entire enterprise decision, or a large subset. The term "business-wide" here reflects the range of data access and use, rather than the physical structure. In the entire enterprise, data warehouses within the business scope can be physically centralized or distributed.
Independent data mart Architecture
An independent data mart architecture implies a separate data mart, which is controlled by a specific working group, department, or business line and fully constructed to meet its needs. In fact, they do not even have any connectivity with data marketplaces in other working groups, departments, or business lines.
Figure 1. Data warehouse architecture Selection
Interconnected data mart Architecture
The interconnected data mart architecture is basically a distributed implementation. Although different data marketplaces are implemented in specific working groups, departments, or production lines, they can be integrated and interconnected to provide a more global data view of the business scope. In fact, they can become data warehouses within the business scope at the highest integration level. This means that end users in one department can access and use the data in the data mart in another department.
Which architecture should you choose?
If your customer's business and data sources are relatively concentrated, the centralized data warehouse architecture within the business scope is the smartest choice. This is a common situation for companies in the intermediate market. Otherwise, for geographically widely distributed businesses, the interconnected data mart and distributed data warehouse within the business scope are more practical options.
An independent data mart architecture is not a good method because it violates the key concept of data warehouse: data integration.
Data Warehouse Implementation Method
The selection of implementation methods is affected by these factors: current IT infrastructure, available resources, selected architecture, scope of implementation, requirements for data access to a larger business scope across enterprises, and return on investment (return-on- investment) requirement and implementation speed. The selection of implementation methods is an important part of data warehouse design. This decision needs to be made in the early stages of the Data Warehouse project.
Top-down implementation
The top-down approach is to implement a data warehouse in a single project phase. The top-down implementation requires more planning and design work at the beginning of the project. This involves the personnel in each working group, department, or business line involved in data warehouse implementation. Data sources, security, data structures, data quality, data standards, and decisions on the entire data model need to be completed before the real implementation starts.
Bottom-up implementation
The bottom-up implementation includes data warehouse planning and design, without waiting for a data warehouse design with a larger business scope. This does not mean that data warehouse design will not be developed for a larger business scope; with the expansion of the initial data warehouse implementation, the construction of it will gradually increase. This method is more widely accepted than the top-down method, because the direct results of the Data Warehouse can be implemented and can be used as a proof to extend the implementation of a larger business scope.
Which implementation should you choose?
Each implementation method has advantages and disadvantages. In many cases, the best method may be a combination of two. One of the keys of this method is to determine the degree to which the architecture of the business scope needs to be used to support the integration plan and design, because the data warehouse is built using a bottom-up method.
When you use a bottom-up or phased data warehouse project model to build a series of data marketplaces in a business-scope architecture, You can integrate data marketplaces in different business topics one by one, to form a well-designed business data warehouse. This method is applicable to the business. Each data mart can handle Identifiable business issues or subject areas to calculate the ROI. Building teams can test and adjust products, and this method also provides a valuable learning curve for building teams.
For companies in the intermediary market, there are some additional reasons to adopt a bottom-up approach:
In the business and business data structure of the intermediate market, there is more variability than the enterprise business data.
Small companies usually have limited project budgets.
Companies in the intermediate market need quick solutions to ease their business difficulty.
The personnel required for such projects must have a broad understanding of the business and detailed knowledge in specific business areas. It is very difficult to find such a person, but even if they can, it is more difficult to use their time for data modeling than to make them do their common business duties.
Data warehouse infrastructure
Now that you have made some decisions about the advanced data warehouse architecture, you can start to consider what components the data warehouse should have.
Figure 2. Advanced view of business intelligence infrastructure components
The data warehouse shall have all components in the Business Intelligence infrastructure described in. This article will focus on the construction of the data warehouse, which involves all these components except information analysis.
These business intelligence components can be defined:
Data source: clearly defines currently available business data sources, external data sources, and data sources that may exist in the future.
Data Acquisition: ETL (Extract, transform, and load) processes used to obtain, cleanse, convert, and integrate data.
Business Data Warehouse: the database of the warehouse data warehouse.
Data transmission: ETL process used to aggregate and enhance data for the data mart.
Data mart: a subset of data warehouses in a more user-friendly data structure.
Information Analysis (not covered in this solution ).
Metadata Management: business requirements, data models, ETL process design, user manuals, and so on.
System Management: data management, data warehouse security, system and database backup and recovery, and so on.
Data warehouse modeling
Data is only a record of all business activities, resources, and enterprise results. Data Models are well-organized and abstract to those data. Therefore, it is extremely natural that data models become the best way to understand and manage enterprise businesses. Data Models play a role in guiding or planning the implementation of data warehouses. Before the real implementation starts, the combination of data models in each business field can help ensure that the results are effective data warehouses and reduce implementation costs.
Modeling of the target warehouse data refers to the process of converting a requirement into a picture and supporting the metadata that represents those requirements. For the purpose of being easy to understand, this article separates requirements from modeling, but in fact these steps are usually overlapping. Once some initial requirements are recorded in the document, the initial model begins to take shape. As the requirements become more complete, the same applies to the model.
The most important thing is to provide end users with a well-integrated and easy-to-interpret data warehouse logic model. These logical models are the core of Data Warehouse metadata. The simplicity provided for end users and the integration and combination of historical data are the key principles that should be provided by modeling methods.
Modeling of warehouse data and Modeling of operational databases
Remember the following questions during Modeling:
The data warehouse is intended for end users. In database operations, users do not directly interact with the database. They use applications that have pre-defined or fixed queries. The database of the data warehouse, especially the data mart, is very close to the end user. It usually does not have a fixed query. Therefore, it must be easier to understand.
The data warehouse should be designed for data analysis. End users process data almost directly, and there is no fixed workflow (except here and there are a few exceptions ). End users are not interested in recording data in warehouses, but they need to obtain information from them. They asked the warehouse a question, tested and verified the assumptions through the extracted information, re-constructed the event chain, and analyzed those events to detect possible patterns or seasonal trends, and make inferences and designs for the future.
The requirements of end users may be vague or incomplete. These incomplete requirements require flexible modeling processes and technologies suitable for evolutionary development. The risk of flexible evolutionary software development is the result of inconsistent and inconsistent terminals. Pay attention to these issues when developing data models.
A data warehouse is a collection of integrated databases, rather than a single database. It should be conceived as a single source of information for all decision-making support processing and all information applications of the entire enterprise. A data warehouse is an "organic" thing. If it is not big enough at the beginning, it will become larger.
A data warehouse contains data of different information topics. These subject fields can be the basis for dividing the Data Warehouse logic into several different (conceptual or even physical) databases. Data Warehouses can also contain different types of data.
A data warehouse usually contains historical data rather than snapshots of daily operation data ). Necessary legacy databases may be unavailable, or may not be captured at a sufficiently fine level, unless you spend money and make efforts to change the legacy input environment. Therefore, project activation in a data warehouse usually involves reengineering of business processes and source applications ).
Two-layer data warehouse design
Data warehouse modeling may be one of the most controversial issues in the field of business intelligence, which will not be discussed in this article. This section describes the two-layer warehouse modeling method, which is most suitable for bottom-up implementation.
Figure 3. Two-layer warehouse modeling
The data repository layer or backend layer contains all the manual model components and complete model structures that process the integrated business data from all necessary data sources. The data warehouse layer has two data modules: staging and data warehouse. The data classification module stores the initial and temporary data of all data sources for ETL processing. The data repository stores all the integrated business data, which is the final goal of the data acquisition ETL process.
The data mart layer contains all data marketplaces, which are a subset of the data repository module so that specific end user groups can be used easily in their data analysis activities.
Because data warehouse and data mart are interchangeable in many cases, it is necessary to specify their definitions based on the two-layer warehouse model. In this model, the business data warehouse is a collection of data storage database (module), and the data mart is a database in the data mart layer. Business data in the data mart can only come from the business data warehouse.
Benefits of Two-layer data warehouse design include:
Flexible and easy to maintain. You can modify the data structure of the data mart at any time according to your report requirements. However, it does not affect the data structure of databases in the data repository.
Easy to scale and integrate. The relational business data repository database is easier to scale and integrate than the non-formal (denormalized) and summary (summarized) databases in the dataset.
User-friendly. Separating the data mart from the data repository makes the design of the data mart more end-user-driven, because the data modeler does not need to consider the problem of data integration and mode scalability too much.
Better security design. The two-layer method separates data storage from Data Access Management. End users can only access the data mart authorized to them, not all data warehouses.
Better data management design. A data warehouse is designed to store historical data in the integrated business scope. However, many data marketplaces in a data warehouse do not need so much historical data. Two-layer design is a good place to store historical data.
Remember that the above two-layer warehouse design is the concept warehouse layout. For example, at the data storage database layer, the database on stage and the database can be on different servers, on the same server, or even in the same database in different modes.
Warehouse data modeling Layer
Data Modeling involves three layers: Concept, logic, and physics. In the design of a data warehouse, each layer of data modeling has its own purpose.
Concept
The advanced data model is a consistent definition of public data elements of all business theme domains and businesses, from an advanced business view to a general Logical Data design. You can give birth to a general scope and gain an understanding of your business needs. This conceptual data model is the basis of the current and future data warehouse development stages.
Logic
The Logical Data Model contains more detailed information about the business subject area. It captures detailed business requirements in the target business theme field. The Logical Data Model is the basis for the physical data modeling of the current project.
Starting from this phase, the solution applies the bottom-up method, which means that this Logical Data Model only targets the most important and urgent business theme fields.
Functions of the Logical Data Model include:
All entities and their relationships.
The specification of each object's attributes.
Rules for all primary keys and Foreign keys.
Normalization and aggregation.
Multi-dimensional data structure specifications.
Physical
Physical constraints of physical data modeling applications, such as space, performance, and physical distribution of data. The physical data model is closely related to the database system and the data warehouse tools you will use. The purpose of this phase is to design a real physical implementation.
It is particularly important to clearly separate logical modeling from physical modeling. Good logic modeling practices focus on the nature of problem domains. Logical modeling solves problems such as "what. Physical Modeling solves problems such as "how" for the model, which indicates the authenticity of the implementation in the given computing environment. Because the business computing environment changes over time, the separation of logical and physical data modeling will help stabilize the logic model from one stage to another.
Once the data warehouse is implemented and the customer starts to use it, they often generate new requests and requirements. This will start another development cycle and continue the iteration and evolution of Data Warehouse Construction. As you can see, the Logical Data Model is the activity part of the Data Warehouse. It is used and maintained throughout the lifecycle of the data warehouse. The data warehouse modeling process is indeed endless.
Figure 4. lifecycle of the Logical Data Model of the Data Warehouse
Database Modeling
Warehouse data warehouse the database stores all clean (cleaned) integrated business data from all business data sources, which is the end point of data integration and transformation in the data warehouse. Although the conceptual data model is designed based on business needs, the Logical Data Model of the data repository is designed based on the analysis of business needs and available business data sources. It is a natural checkpoint used to verify whether existing business data supports business requirements in the data warehouse project.
The database is basically a regular (normalized) relational database. Because they are source-driven, the source data model can be used to assist in the full development of the warehouse data repository model. You may need to construct the source data model by using reverse engineering (reverse engineering) technology to develop the Entity and Relation (ER) model from the existing source database. You may need to first integrate several of these models into a global model to represent the source in the form of logical integration.
Data in the data warehouse is cleaned during data conversion. It should at least be as good as the data in the source operating system. In the ETL process, real-time reference integrity and check constraints are extremely helpful for detecting data problems, and they are not efficient in the final data warehouse table.
An important feature of a data warehouse is that it contains historical data. Based on the update frequency, there are two types of historical data: Slow update and fast update. For historical data with slow updates, the update is performed using a historical sub-table with valid state and time frame data.
Transactions and Web traversal data are typical examples of fast data updates. They usually have large (Old and New) data volumes. Performance is the most important design issue for storing and updating historical data quickly. For example, there was a large volume of transaction data starting in 1999, but as shown in business requirements, only transaction data in the past three years was frequently accessed to create reports. Figure 5 shows the logic and physical partition design of an advanced transaction table.
Figure 5. logical and physical partitions of the transaction table
Data Modeling of data mart
Because the warehouse end users directly interact with the data mart, data mart modeling is one of the most effective tools to capture the business needs of end users. The data mart modeling process depends on many factors. The three most important items are described below:
Data mart modeling is end-user-driven. End users must participate in the modeling process of the data mart because they are clearly the people who want to use the data mart. Because you should expect end users to be completely unfamiliar with complex data models, the modeling technology and modeling process should be organized as a whole to make complexity transparent to end users.
Data mart modeling is driven by business needs. The data mart model is very useful for capturing business needs because it is usually used directly by end users and is easy to understand.
Data mart modeling is greatly affected by the data analysis technology. Data analysis technology can affect the type and content of the selected data model. Currently, there are several common data analysis technologies: Query and report production, multidimensional analysis, and data mining.
If you only want to provide the query and report production functions, the ER model with a regular (normalized) or informal (denormalized) data structure is the most suitable. The dimension data model may also be a good choice because it is user-friendly and has better performance. If the objective is to perform multi-dimensional data analysis, the Dimension Data Model is the only choice here. However, data mining usually works best at the available level of detail. Therefore, if a data warehouse is used for data mining, the model should contain level of detail data.
Because ER modeling is not included in this article, this section will discuss Dimensional Data Modeling, which is the most important data modeling method in data mart design.
Star and snow pattern Models
There are two basic data models that can be used in dimensional modeling:
Star mode (or model)
Snowflake Model
The star schema has become a public term used to represent a dimension model. Database designers have long used the term "star mode" to describe the dimension model, because the result structure looks like a star. Some business users are not used to the term "model", so they accept the more simple term "star model ". The terms "Star Model" and "star mode" can be exchanged.
A star model is the basic structure of a dimension model. It usually has a large central table (called a fact table) and a group of small tables (called dimension tables). dimension tables focus on fact tables in a radiation mode.
The traditional ER model has a uniform and balanced complex relationship between entity styles and entities, while the star model is completely asymmetrical. A single connection exists between fact tables in the dimension model and all other dimension tables.
Dimensional modeling usually specifies facts and dimensions after collecting business requirements. The initial dimension model is usually like a star in the External table. A fact is in the center, and multiple dimensions at the same level are surrounded. Snowflake models are the decomposition results of one or more dimensions, which sometimes have hierarchies. You can define a multi-to-one relationship between members in a dimension table as a separate dimension table to form a hierarchy.
The decomposed snowflake structure shows the dimension hierarchy. The snowflake model is easy for data modelers to understand and use by database designers for dimensional analysis. However, the structure of the snowflake looks more complex and may tend to make business users feel less comfortable with a simpler star model. Developers can also select a snowflake because it usually saves data storage. Consider a banking application, which has an extremely large account table for a dimension. Instead of storing text fields that are frequently duplicated, you can store them in a subdimension table to save a lot of space.
Although the snowflake model does not save space, it is usually not important compared with fact tables. Most database designers do not consider space saving as the main decision-making criterion for choosing modeling technology.
Figure 6. Star Schema example
Dimensions and measures
Users usually need to evaluate or analyze some aspects of enterprise business. The collected requirements must represent two key elements of the analysis: the analyzed content and the evaluation criteria used for the analysis content. The evaluation standard is called measurement (the digital attribute of fact), and the analyzed content is called dimension (the descriptive attribute of fact ). A group of dimensions and their related measurements form the so-called fact.
Dimension
The basic structure of a dimension is a hierarchy. Dimension hierarchies are used to aggregate business measurements, such as Total Revenue of Sales, at a level of detail that is smaller than the basic granularity presented in the dimensional model ). In this example, the operation is called roll-up processing. Volume processing is performed on basic facts or measures in the dimension model.
If you roll a measure to a lower level of detail, the end user can obviously perform the inverse operation (drill down), which includes viewing more detailed measures, or, you can explore lower detail-level aggregation measurements based on the dimension hierarchy.
One of the most important activities of dimensional modeling is to capture the clustering path or the aggregation hierarchy that the end user uses to perform the drill-up and drill-down operations. This process will generate a dimension model. You will expand and modify this model when you execute other modeling activities later, for example, modeling of time dimensions with slow changes, processing constraints in dimensions, and capturing cross-dimensional relationships and constraints.
Dimensional modeling is closely related to end users and business processes. In order to make the dimension model last longer and apply to more user groups, it is particularly important to build a dimension from a conceptual perspective, find the basic clustering path and clustering level that the end user community will be interested in.
Generally, you can add a measure to a well-defined fact, which has no significant impact on the model. However, this is certainly incorrect for dimensions, because the structure of the dimension hierarchy may become complex.
When you consider problem domains in a broad environment, you expect multiple clustering paths in one dimension. Partitions in the dimension hierarchy can appear at different levels. The split hierarchies can be split again later. This process may lead to complicated models, which may be too complicated for end users to handle. Please consult your end users to avoid unnecessary disputes.
An important decision that is hard to make is whether a clustering layer is an element of a (structured) hierarchy or whether it is only an attribute of a dimension. For example, is it wise or necessary to use a product packaging unit, brand, or storage type as an explicit element of a dimension path (that is, as a potential entity type in a dimension hierarchy? Or can we just consider them as product attributes?
Finding the basic clustering path in a dimension (such as the Product dimension) usually means investigating many typical relationships in a dimension.
Constructor: information analysts use these relationships to explore the constructor relationships between products and their components. For example, cost related to product components and product construction can be used to calculate product costs.
Variation relationships: changes are used to differentiate products in terms of product models, versions, implementations, component mixing, and deployment.
Changes can also be used to specify product replacement. Information Analysts use change relationships to combine related products and aggregate correlation measurements, because lower-level products may only exist for a limited period of time, or because they are often used to replace each other in a certain process (for example, when an "initial" product is out of stock and a certain version of the product is sold to the customer ).
Classification relationship: classification refers to dividing similar products into groups. Link classification is clearly the most frequent relationship between products, and information analysts are used to measure the details of the volume. Note that multiple types of classification are usually required. For example, products can be classified based on the characteristics of sales, manufacturing, reserve, or supply. Information Analysts use classification to measure the aggregation in a statistical group. The statistical group includes the total number, average number, minimum value, and maximum value.
Fact
Fact tables only contain IDs used to reference dimension tables, and measurements used to measure changes or performance of all dimension members. The next step is to organize dimensions and measures into facts. This is the process of grouping dimensions and measurements in a way that can meet specified requirements.
Several important issues should be solved in the design of fact tables:
Granularity (detailed records of facts): To effectively analyze data, it must all be at the same granularity level. Generally, you should put the data at the most detailed granularity level. This is because you cannot modify the data to the details level you have set. However, you can always roll (aggregate) data to create a coarse-grained table.
Addition (ability of measure to be summarized): a measure is divided into three categories: fully additive, nonadditive, and semiadditive ). The example of a non-sum measure is the percentage. You cannot simply add the percentages of two facts together to produce meaningful results. An example of a semi-sum measure is the balance. Although you can add the balances of the two accounts to obtain the total balance, you cannot add the two balances of the same account at two different time points. Because the balance is only added across some dimensions, we call it a semi-sum measure. Values that can be added across all dimensions are regarded as all values. When you consider the possible summary of a fact table, addition becomes important. Generally, the sum measure is the most ideal. When measures are not fully summed up, they should be considered to be divided into their atomic elements.
Key Selection: key selection in multi-dimensional data modeling is a challenge. It includes trade-off between performance and ease of management ). Key selection mainly applies to dimensions. The key you select for the dimension must be the foreign key of the fact. The dimension key has two options: You can allocate an arbitrary key or use an identifier in the operating system. A key is usually a serial number. When a new key is required, an available number is assigned.
To uniquely represent a dimension using an identifier in the operating system, you sometimes need to use a composite key. A composite key is a key consisting of multiple columns. Any key is a column, which is usually smaller than the key derived from the operation. Therefore, any key can usually execute connections faster.
The last factor in key selection is its impact on the fact table. In real-time creation, the key of each dimension must be assigned to it. If a dimension uses a key derived from an operation with a timestamp for historical data, there is no additional work in the real-time creation process. The connection will automatically occur. For any key or any historical identifier, a key must be allocated to the fact in real time during creation.
There are two ways to assign keys. One is maintenance operation and data warehouse key conversion table. The other is to store operation keys and, if necessary, store timestamps as attribute data on dimensions.
Then, the selection is between the better performance of any key and the easier maintenance of the operation key. You must evaluate the performance increase and maintenance increase in your own organization.
No matter what choices you make, you must use documents to record the process of generating them in the metadata. This information is necessary for technicians who manage and maintain data warehouses. If the tool you are using does not hide connection processing, you may need to understand this.
Now that you understand how to process dimensions and fact tables, let's take a look at real-world examples to explore how to identify dimensions and measurements from business needs. This example is only a basic analysis of a series of business problems. These business problems are defined as example requirements:
What is the average customer balance and transaction volume for this month based on the bank sub-branch?
According to the summary of sub-branches, products and regions, what is the annual net profit and interest to be paid to each customer?
What percentage of customers are profitable? Classify them by sub-branch, region, and year.
What is the total transaction volume of a customer this year?
What are the top 5 most profitable products by region?
In the past five years, what are the top five most profitable branches?
What are the population and geographical characteristics of the most profitable customers?
By analyzing these problems, we define the dimensions and measurements to meet the requirements (see table 1 ).
Table 1. dimension and measurement tables
Dimensions and measures |
Q1 |
Q2 |
Q3 |
Q4 |
Q5 |
Q6 |
Q7 |
Dimension |
? |
? |
? |
? |
? |
? |
? |
Sub-branch |
? X |
? X |
? X |
? |
? |
? X |
? X |
Region |
? |
? X |
? X |
? |
? X |
? |
? X |
Customer |
? X |
? X |
? X |
? X |
? |
? |
? X |
Product |
? |
? X |
? X |
? |
? X |
? |
? |
Time |
? X |
? X |
? X |
? X |
? |
? X |
? |
? |
? |
? |
? |
? |
? |
? |
? |
Measure |
? |
? |
? |
? |
? |
? |
? |
Balance |
? X |
? |
? |
? |
? |
? |
? |
Transaction volume |
? |
? |
? |
? X |
? |
? |
? |
Transaction count |
? X |
? |
? |
? |
? |
? |
? |
Net profit |
? |
? X |
? X |
? |
? X |
? X |
? X |
Interest paid |
? |
? X |
? |
? |
? |
? |
? |
In this case, you need to check the dimension to ensure that:
You have data to answer questions.
All measures are defined at the finest level.
Use these simplified analysis questions to determine what is included in the final star model and what is excluded:
The balances and transactions are based on the aggregation of transaction volumes, so they are derived measurements.
The calculated interest is to multiply the account interest rate by the balance. This is based on the account and month. Because interest rate is an attribute of the Account table, you need to add an Account dimension table. As you can see, the interest paid is also a derived measure.
Assume that the net profit is calculated based on (investment income)-(interest paid. Because the investment income is a bank-level measure, and the interest paid is a derived measure, the net profit is also a derived measure.
The conclusions from the above analysis are as follows:
Transaction volume is the only required measurement.
The account dimension is required to generate interest rates and investment income information.
Model metadata
In a traditional development cycle, a model is used only when it needs to be modified or data is required by other projects. However, the model must be used continuously in the repository. Repository users constantly access this model to determine the data they need to use to analyze the organization. Data Structures in warehouses are modified much faster than operational data structures. Therefore, technical users (administrators, modelers, designers, and so on) in the repository will also use your model on a regular basis.
This is the task of metadata. The model must be a complete representation of the stored data. Otherwise, it is useless to anyone.
To correctly understand the model and confirm that it meets your needs, you must access metadata that describes the repository according to easy-to-understand business terms. Therefore, you should record non-technical metadata at this time. In the design phase, you will add technical metadata to it.
At the warehouse level, you should provide a list of available items in the warehouse. This list contains available models, dimensions, facts, and measurements, because these are used as initial entry points when you start to analyze data.
Name, definition, and purpose are provided for each model. The name only provides users with things that interest users during search. It is usually the same as the fact. Define the content of the specified modeling, and the purpose is to describe what the model is used. The metadata of the model should also contain a list of related dimensions, facts, and measurements, as well as the name of the contact person, so that the user can obtain additional information when there is a question about the model.
Model Verification
Before investing a lot of time and effort to implement a warehouse database, it is a good idea to verify the model with the user, especially the data mart model. The purpose of this check is double. First, it confirms that the model can truly meet user needs. Second, check whether the user can understand the model. Remember that once a warehouse is implemented, users regularly rely on the model to access the data in the warehouse. No matter how well the model meets your needs, Your repository fails if you cannot understand the model to access data.
In this case, the verification is completed at a high level. Check the model with the user to make sure it is understandable. Together with the user, test the model by solving how you will answer certain questions specified in the requirement.
This is good because the model does not have to satisfy all user needs. This does not mean that you should stop and return to start. We expect that the first cut-in of your model will meet about 50% of your needs. Accept the 50% (or no matter how much) of the model and start the physical design. The remaining items should be sent back to the collection phase of the requirement. Either you need to better understand the requirements, or you usually need to modify and redefine them. This often leads to the addition and modification of the created model. At the same time, the effective part of the model will pass the design phase and begin to provide users with benefits.
Repeated development and continuous creation of some complete models are key elements that provide the ability to quickly develop data warehouses.
Figure 7. Warehouse data model evaluation process
During requirement verification, you will:
Check the consistency, integrity, and effectiveness of the initial Dimension Model Based on the given user requirements. Analyze the initial model with end users. This will allow the demand analyst to perform more surveys and adjust these initial models before passing them to the Requirement Modeling phase (attempts to follow the change requirements described in the model ).
Specifies a candidate data source. Create Inventory of necessary and available data sources.
Map the initial dimension model and the information end user requirements that may be configured to the specified data source. This is usually a tedious task. Source data ing must investigate the following ing problems:
Which source data items are available and which are unavailable? Should the source application be extended for unavailable ones? Can I find them using external data sources? Or should end users be notified of their unavailability and, therefore, should the range of dimension models be reduced?
Are there any other interesting data items available but not requested in the data source? Specifying available but not requested data items may expose other interesting aspects of the information analysis activity, and thus greatly affect the content and structure of the constructed dimension model.
Verify that the data mart design does not violate any business security settings. Because the data mart is designed for a specific terminal user group, it is a good idea to confirm that it only contains the necessary information of this group.
Execute the initial model size adjustment. If possible, the initial size adjustment should also investigate the capacity and performance related to the data warehouse filling.
The results of requirement verification will help you evaluate the scope and complexity of the Data Warehouse development project, and (re) Evaluate Business adjustments, including the evaluation of technology, finance, and resources. Requirement verification must be performed in collaboration with end users to expose and correct all problems in an incomplete or incorrect initial model. Requirement verification may involve building a prototype of a dimension model.
The requirement verification process will confirm or recreate end user requirements and expectations. As a result of requirement verification, you may also specify and evaluate the reengineering suggestions. At the end of the requirement verification, you will receive a (new) "signature" for the data warehouse modeling project ".
Follow-up
In the next article in this series, we will continue to explore the design and implementation of the warehouse ETL process, warehouse performance, and security.