The basic concept and process of dimension modeling

Source: Internet
Author: User

absrtact: This paper first introduces the 2 basic elements of the dimension table and fact table in the dimension model, and secondly, introduces the 4 basic steps of the design dimension model, and thirdly, around the need of a bank to realize the integration of the business value chain data, Introduces 3 key concepts in Multidimensional architectures: Data Warehouse bus structure, conformance dimension, and conformance facts.

keywords: dimension table; fact table; dimension model design process; Data Warehouse bus structure; consistency dimension; consistency fact.

0 Introduction

Unlike popular parlance, Ralph Kimball himself does not define terms such as "dimensions" and "facts". The term "dimensions" and "facts" were originally proposed in the 1960s in a joint research project sponsored by General Mills and Dartmouth University. In the 70 's, both AC Nielsen and IRI consistently used these terms to describe their data-publishing applications, which are, more accurately, the Dimension Data mart for retail data Mart. In the long term before simplicity became a lifestyle trend, early database monopolies were dedicated to simplifying the information used for analysis. They realized that no one would use it unless the database was made easy to use. Therefore, the construction idea of dimension model is produced by driving the understanding and performance as the highest goal.

1 dimension tables and fact tables

1.1 Fact Sheet

The fact table is the basic table for the dimension model, where 1.1 stores a large number of business performance measures. The measure data obtained from a business process is attempted to be stored in a single data mart. Because measure data is overwhelmingly the largest part of any data mart, you should avoid storing its copies in different parts of the enterprise. Use the term "facts" to represent a business measure. Consider a scenario where a customer is queried for a point-in-time balance in a currency of a product contract account in an organization, and a measure can be obtained at the intersection of each dimension value (customer, product contract, account, institution, currency, date). A list of dimension values gives the granularity definition of the fact table and determines what the value range of the measure is.

Figure 1.1 Example fact table

A row of the fact table corresponds to a measure, one measure is a row of the fact table, and all the measure values of the fact table must have the same granularity. The most useful fact is that numeric types such as account balances are facts that can be additive. Scalability is critical because data warehouse applications not only retrieve single-row data for the fact table. Conversely, the fact that you tend to bring back hundreds of, thousands of, or millions of rows at a time, and the most useful thing to do with so many lines is to add them up.

Of course, some facts are semi-additive, while others are non-additive in nature. The semi-additive fact is only added along some dimensions, such as the sales ratio, the cycle balance, and non-additive facts cannot be added at all, such as state. For non-additive facts, if you want to summarize a row, you have to use a count or an average, or you can print out all the fact rows one line at a time.

Measurement facts can be in the form of text in theory, but this rarely happens. In most cases, a text measure can be a description of something and taken from the value of a discrete list. Designers should make every effort to convert text metrics into dimensions, because the dimensions can be more effectively correlated with other text dimension attributes and consume much less space. The redundant text information cannot be stored in the fact table. Unless the text is unique for each row of the fact table, it should be attributed to the dimension table. True textual facts are seldom seen in data warehouses, and text facts have unpredictable content like free-text content, which is almost impossible to analyze.

All fact tables have two or more than two outer keywords (the part of the FK symbol marked in 1.1), and the outer keyword is used to connect to the primary key of the dimension table. For example, the product contract keyword in the fact table always matches a specific product contract keyword for the Product Contract dimension table. If all the keywords in the fact table match correctly with the primary key in the corresponding dimension table, you can say that the tables meet the referential integrity requirements. The fact table is accessed through the dimension tables that are attached to it.

The fact table can be divided into the Transaction fact table, the periodic Snapshot fact table and the cumulative snapshot fact table according to the granularity of the role division. The Transaction fact table is used to host transactional data, which is usually less granular, such as the facts of a product transaction transaction, the facts of an ATM transaction transaction, and a periodic snapshot fact table used to record a regular, fixed time interval of business cumulative data, usually with a higher granularity, such as a fact table for the monthly average balance of accounts The cumulative Snapshot fact table is used to record information about the entire process of a business process with a timespan, which is often rare. It is important to note that in the design of the fact table, it is important to note that a fact table can have only one granularity, and the fact that different granularity cannot be established in the same fact table.

1.2 dimension table

A dimension table is an integral part of a fact table. 1.2, the dimension table contains a textual description of the business. In a well-designed dimension model, dimension tables have many columns or properties that give a description of the rows of the dimension table. Should include as many meaningful textual descriptions as possible. It is not uncommon for a dimension table to contain 50 to 100 properties. Dimension tables tend to make a fairly small number of rows (typically less than 1 million rows), and the number of columns is particularly large. Each dimension is defined with a single primary key (the part marked with the PK symbol in 1.2), and the primary key is the basis for ensuring referential integrity between any fact tables that are connected to the same.

Figure 1.2 Sample dimension table

Dimension attributes are the basic source of query constraints, groups, and report label generation. In Query and report requests, attributes are identified by this word. For example, if a user indicates that they want to view the account balance by "product contract number" and "Institution number", then "product contract number" and "Institution number" must be available dimension attributes.

Dimension table Properties assume a significant role in the Data Warehouse. Since they are actually the source of all the interesting constraints and report labels, they are the key to making the data warehouse easy to learn and use. In many ways, a data warehouse is just a reflection of a dimension attribute. The capabilities of the data warehouse are directly proportional to the quality and depth of the dimension attributes. The more time it takes to provide detailed business language attributes, the better the Data warehouse will be. The more time you spend on a given aspect of a property column value, the better the Data warehouse will be. The more time it takes to ensure the quality of attribute column values, the better the Data warehouse is.

The dimension table is the entry into the fact table. Rich dimension attributes give a wealth of analytical cutting power. Dimensions provide the user with an interface to use the Data warehouse. The best attributes are textual and discrete. The attribute should be real text and should not be a shorthand symbol for coding. You should try to minimize the use of encoding in dimension tables by replacing the encoding with more detailed text attributes. Sometimes when designing a database, it is not certain whether a numeric data field extracted from a data source should be treated as a fact or a dimension attribute. It is often possible to make a decision that the field is a measure that contains many values and participates in the operation (when viewed as a matter of fact), or a description of the discrete values that are not much changed and participate as constraints (as the dimension attribute treats).

In the dimension type, there is an important dimension called the degenerate dimension (degenerate Dimension), which refers to simply placing some simple dimensions in the fact table rather than just doing a dimension table. The degenerate dimension is a very important concept in the domain of dimension modeling, which is very important for understanding dimension modeling, and the degenerate dimension is often combined with some other dimensions to form the primary key of the fact table. Degenerate dimensions can be used for grouping in analysis.

fusion of 1.3 dimension tables and fact tables

Now that you have understood the facts and dimension tables, consider merging the two blocks together into the dimension model. As shown in 1.3, a fact table consisting of a numeric measure is connected to a set of dimension tables that fill the description attribute-the star feature structure is often called a star join scheme. The term can be traced back to the earliest relational database period.

Figure 1.3 Fact and dimension tables in the dimension model

The first thing to notice about the dimension scheme used in it is its simplicity and symmetry. Clearly, business users benefit from simplicity because the data is easy to understand and navigate.

The simplicity of the dimension model also brings performance benefits. The database optimizer can more efficiently handle these simple scenarios with fewer connection relationships. The very strong approach that the database engine can take is to first centralize and constrain (filter) the dimension tables that have sufficient indexes, and then process all the fact tables with the Cartesian product of the dimension table keyword that satisfies the user's constraints. Surprisingly, by using this method only one index of the fact table is used, you can calculate any n connection results to the fact table.

Finally, the dimension model can be extended naturally to suit the needs of change. The predictable framework of the dimension model can withstand the challenges of unpredictable changes in user behavior. Each dimension is equal, and all dimensions are peer portals into the fact table. This logic model does not have a built-in bias on the form of a desired query, and there is no business problem to ask this month, as compared to next month's priority considerations. No one would want to adjust the design if business users are doing business analysis in a new way.

Optimal granularity or atomic data has the best dimensions. The aggregated atomic data is the most expressive data. Atomic data should be the basis for each fact table design to withstand special attacks caused by queries that business users cannot foresee. For a dimension model, it is entirely possible to add a new dimension to the scheme, as long as its value is unique for each existing event. Similarly, new unforeseen facts can be added to the fact table, as long as the level of detail is consistent with the existing fact table. A previously existing dimension table can be supplemented with a new, unanticipated attribute, or an existing dimension row can be decomposed at a lower level of granularity from a forward point-in-time perspective. In each case, you can simply add a new row of data to the table or execute a SQL ALTER table command to make the appropriate modifications to the existing table. Data does not need to be reloaded, and all existing data access applications can continue to run without producing different results.

2 dimensional modeling design process

In this paper, the design of dimension database is carried out in the way of four steps with a certain order in Figure 2.1.

Figure 2.1 Four step dimension design process

2.1 First step to select business processing

A business process is a natural business activity in an organization that is generally supported by a source system. Listening to users ' opinions is the most efficient way to pick a business process. In selecting the business phase, the data Model designer needs to have a global and developmental perspective, and should understand the overall business process and choose the business process from a global perspective.

The important thing to remember is that the business process mentioned here does not refer to the business unit or function. By focusing on the business process rather than the business sector, it is possible to submit consistent data more economically within the organization. If the established dimension model is bundled with the department, it is not possible to avoid the possibility of a copy of the data with different tags and terms. The flow of multiple data to a separate dimension model can make users vulnerable to problems with inconsistencies. The best way to ensure consistency is to publish the data once and for all. A single release process can also reduce the amount of ETL development, as well as subsequent data management and disk storage burdens.

2.2 Second step defining granularity

The granularity definition implies a clear explanation of what is actually represented in each fact table row. Granularity passes information about the degree to which the details associated with the fact table measure are reached. It gives the answer to the following question: "How to describe a single row of the fact table?".

Particle size definition is a crucial step that cannot be underestimated. The dimension model should be developed in order to define granularity as a priority when obtaining the most atomic information for business processing. Atomic data is the most detailed information collected, and such data cannot be further subdivided. By assembling data at the lowest level, most atomic granularity shows its value in applications with multiple front ends. Atomic data is highly dimensional structured. The more subtle and atomic the fact metric is, the more accurate it is to know more things, all the things that are exactly known to be converted to dimensions. At this point, atomic data can be said to be an excellent match for the dimension method.

Atomic data provides maximum flexibility for analysis because it accepts any possible form of constraint and can appear in any possible form. The detail data of the dimension model is steady and ready to accept special attacks from business users.

Of course, you can always define a higher level of granularity for business processing, which represents the aggregation of the most atomic data. However, simply choosing a higher level of granularity means limiting yourself to a dimension that is less or more likely to be smaller in detail. A model with less granularity can be easily attacked by unpredictable user requests that go deep into the details of the content. Aggregating summary data plays a very important role as a means of tuning performance, but it is definitely not a substitute for user access to the lowest level of detail. Unfortunately, some pundits have been vague about this. They claim that the dimension model is only suitable for summative data, and that it criticizes the perception that the dimensional modeling approach can meet the forecasted business needs. Such misunderstandings slowly fade along with the advent of detail-based atomic data in the dimension model.

2.3 The third step of the selected dimension

The question that the dimension raises is, "How will the business people describe the data that is obtained from the business process?" The fact table should be dressed up with a set of rich descriptions that take a single value in each measure context and represent all possible situations. If the granularity of the content is clear, then the determination of the dimension is generally very easy. With the selection of dimensions, you can list the discrete text attributes that make each dimension table plump. Examples of common dimensions include dates, products, customers, accounts, institutions, and so on.

2.4 The fourth step to determine the facts

The fourth and final step in the design process is to carefully determine which facts appear in the fact table. The determination of facts can be done by answering the question "What content to evaluate". Business users have a strong interest in the analysis of these business processing performance metrics. All the information that is available for selection in the design must meet the granularity requirements defined in step 2nd. Facts that are obviously of different granularity must be placed in a separate fact table. It is generally possible to create a fact table from the following three angles [2]:

1, for a specific behavior action, to establish a minimum unit of behavioral activity as the granularity of the fact table. The definition of the minimum activity unit relies on the analysis of business requirements. such as the user's one page click Behavior, a site login behavior, a phone call record. This fact table is mainly used to analyze the data from multiple dimensions, the occurrence of behavior, mainly for the distribution of business, performance evaluation and comparison.

2. The state of the current time for an entity object. By storing a snapshot of the entity object at different stages, such as the balance of the account, the number of products the user has, and so on, we can count the key quantitative indicators of the entity objects in different life cycles.

3, for the business activities in the important analysis and tracking objects, statistics in the entire enterprise different business activities of the occurrence. Members, for example, can perform or participate in a number of specific behavioral activities. This fact table is a summary and generalization of the two fact tables. It is mainly used for tracking and investigating the activity objects in our business.

3 Data Warehouse bus structure

Business and IT organizations are generally interested in the integration of different business processes. The desire of low-level business analysts in this regard may not be urgent, but those at the higher levels of management are well aware that data viewing across the business is necessary to improve performance. Many data warehousing projects focus on the end-to-end perspective to better understand the management needs of customer relationships. 3.1, in a large state-owned bank, in the business value chain of product operations, including many related business processing, such as marketing support, product operations, risk control, financial performance and many other business processes.

Figure 3.1 Business Value chain

If the dimensions of these business processes are modeled separately, independent data marts are created, and the data marts do not share a common dimension, then problems arise, the data mart becomes an isolated bazaar, it cannot be combined into a data warehouse, and a consistent dimension is formally proposed to address this problem. Figure 3.2 shows a logical representation of this dimension sharing scenario.

Figure 3.2 Dimension sharing between business processes

Shared public dimensions are absolutely decisive for designing data marts that can be integrated. Doing so allows performance metrics from different processes to be combined into a single report. The implementation process is to use multi-channel SQL to query individual markets separately, and then to impose outer joins on the query results based on a common dimension attribute. This connection, commonly referred to as cross-probing (Drill Across), is straightforward in cases where the dimension table attribute is identity.

The bus structure is the most basic element in the process of processing a set of related business in various locations into a comprehensive data warehouse.

3.1 Data Warehouse bus structure

Obviously, it is daunting to build an enterprise data warehouse in a single step, but dividing it into isolated fragments to build will frustrate the overriding goal of consistency. To enable the data warehouse to run successfully over the long term, there is a need for an architecture that can incrementally build an enterprise data warehouse. One of the methods advocated here is the Data Warehouse bus structure.

By defining a standard bus interface for a data warehouse environment, a separate data mart can be implemented by different groups at different times. As long as this standard is followed, independent data marts can be inserted together and effectively coexist. All business processes will create a family of dimension models that share a comprehensive set of consistent, common dimensions, 3.3.

Figure 3.3 Data Warehouse bus structure

The Data Warehouse bus architecture provides a reasonable way to decompose enterprise Data Warehouse planning tasks. Within a relatively short period of time in the architecture establishment phase, the development team designed a set of standardized dimensions and facts that have a unified interpretation across the enterprise. In this way, the framework of the data architecture is set up. The development team can then go all out to implement an independent data mart that is developed in an iterative way, in strict accordance with the architecture. As independent data marts are put into use, they are built together like blocks of bricks. In a sense, there is a need to have enough data marts to bring good prospects for an integrated enterprise Data warehouse.

The bus structure enables data warehouse managers to gain two advantages. On the one hand, they have an architectural framework that guides the overall design and divide the problem into a byte-metered data mart block that can be implemented according to a specific time frame. On the other hand, each data mart development team follows the guidance of the system and can work asynchronously with relative independence.

3.2 conformance Dimension

After understanding the importance of the bus structure, it is now possible to further develop the consistency standard dimension of the cornerstone role of the Data Warehouse bus. The consistency dimension is either the same or a subset of dimensions with the best granularity and detail in a strictly mathematical sense . For example, if you establish a month dimension session, the various descriptions of the month dimension must be exactly the same as in the date dimension, and the most common practice is to establish a monthly dimension of the view on the date dimension. This allows the month dimension to be a subset of the date dimension, which can be consistent during subsequent drillthrough operations.

Consistent dimensions have consistent dimension keywords, consistent attribute column names, consistent attribute definitions, and consistent attribute values (which will be translated into consistent report labels and group identities). If the tag of a property tag is different or contains a different value, the dimension table is not consistent (not handled as consistent). If the customer or Product dimension is configured in a non-consistent manner, either the scattered data marts cannot be used together or, more seriously, trying to use them together will produce invalid results.

A consistent dimension appears in several different styles. At the most basic level, a consistent dimension means identical content to each of the possible fact tables that are connected to them. Connect to a product service contract the de facto Date dimension table is the same as the Date dimension table that is connected to the product service account balance in fact. In fact, a consistent dimension may be the same physical table within a database scope. However, based on the typical complexity of a data warehouse technology environment with multiple database platforms, dimensions are more likely to have copies at the same time in each data mart. In either case, the date dimension of the two data marts will have the same number of rows, the same keyword value, the same property label, the same attribute definition and the same property value, and so on. Similarly, there are consistent data content, data interpretation and user presentation.

3.3 Consistency Facts

So far, we've discussed the central task of building a consistent dimension to keep the data mart together. This covers a great deal of effort in the development of data Warehouse migrations, and the remaining effort is devoted to establishing a coherent factual definition.

Often, metrics like profit, economic capital, product coverage, customer satisfaction, and other key indicators (KPIs) that need to be shared at the enterprise level are the facts that must be consistent. Generally, fact table data is not explicitly copied across data marts. However, if the fact exists in more than one position, then the definition of supporting these facts must be the same as the equations (formulas), and if they are treated as if they were the same, they would need to be defined in the same dimensional context if they were the same. At the same time, it has the same unit of measure across the data marts. You must accept the constraints of the specification in the data naming practice, and if it is not possible to make the facts exactly the same, you should give different names to different interpretations. This can reduce the likelihood of using incompatible facts in the calculation.

4 Summary

In this paper, as a summary of dimensional modeling, based on the theory of dimension modeling and the practice of dimension modeling in an enterprise, the basic concepts of fact table, dimension table, data Warehouse bus structure, consistency dimension, consistency fact, and dimension modeling are introduced.

5 references

[1]. Ralph Kimball, Tanmingkin translated. Data Warehouse Toolbox: A complete Guide to Dimensional Modeling (second edition), Electronic industry Press, 2003.

2 reference to 3 different types of fact tables

The basic concept and process of dimension modeling

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.