Discussion on the dimension modeling of Data Warehouse __ Data Warehouse

Source: Internet
Author: User
0x00 Preface

The following content, is the author in the study and work of some summary, of which the concept of most of the content from the book, the practical content mostly from their own work and personal understanding. Due to the lack of qualifications, there will inevitably be many mistakes, I hope to criticize. Overview

The Data warehouse contains a lot of content, which can include architecture, modeling, and methodologies. For specific work, it can include the following: A data architecture system centered on Hadoop, Spark, hive, etc. Various data modeling methods, such as dimension modeling. Auxiliary systems such as scheduling system, metadata system, ETL system and visualization system.

For the moment, regardless of the scope of the Data Warehouse, in the Data Warehouse system, the core position of the data model is irreplaceable.

Therefore, the following is a detailed description of the typical data modeling representative: Dimension modeling, the relevant theory of it and the actual use of the in-depth analysis. article Structure

This article will be described in the following order: First, compare the classic and common data Warehouse model, and analyze its advantages and disadvantages. The basic concepts and related theories of dimension modeling are introduced in detail. To get a better understanding of what dimension modeling is, I'm going to simulate an electric arcade that we all know well, and use the theory we've talked about before modeling. The theory and the actual work scene will have the difference after all, this piece, I will share the enterprise in the actual application to make the choice. 0x01 Classic Data Warehouse model

Here are four types of data warehouse models, among them, the first three models correspond to three books: Data Warehouse, Data Warehouse toolbox and large Data warehouse and database vault, all three books are in Chinese, so coincidentally, I only have three data warehouse books, Exactly corresponds to these three theories.

The anchor model I am not particularly familiar with, put here for reference. Entity Relationship (ER) model

The parent of Data Warehouse Immon the method of designing a 3NF model from the height of the whole enterprise, the data model described by entity plus relation describes enterprise business structure, which conforms to 3NF in the paradigm, and differs from 3NF in OLTP system, in which the 3NF on the Data Warehouse is based on the topic-oriented abstraction of the enterprise. Rather than an abstraction of the entity object relationship for a particular business process, it is more data-oriented consolidation and consistent governance, as Immon would like to achieve: "Single version of the Truth".

But there is also a challenge to build with this approach: a comprehensive understanding of the enterprise business and data implementation cycle is very long to the ability of the modeling staff is also very high two, dimension model

Dimension model is the Data Warehouse field another master Ralph Kimball advocated, his "the Datawarehouse toolkit-the Complete Guide to Dimensona modeling, the Chinese name of the Data Warehouse Toolbox," Is the most popular data warehouse in the field of mathematical modeling Classic. Dimension modeling to analyze the requirements of decision building model, the construction of the data model for the analysis of demand services, so it focuses on how users can quickly complete the analysis requirements, but also a better response performance of large-scale complex queries.

Typical representations are the star models we know better, and the snowflake models that are applicable in some special scenarios. Third, Datavault

Datavault is a model methodology created by Dan Linstedt, which derives from the ER relational model, and is designed to achieve data integration and is not directly used for data decision analysis. It emphasizes the establishment of an auditable base data tier, it emphasizes the historical traceability and atomicity of data, and does not require the excessive consistency processing and integration of data, but also organizes the enterprise data according to the topic concept, and introduces further paradigm processing to optimize the model to deal with the expansion of the source system change.

It mainly consistsof: Hub (key core business entity), Link (relationship), Satellite (Entity attribute) three parts. four, the anchor model

The anchor model is made up of Lars. Rönnbäck is designed to design a highly scalable model, the core idea: All extensions are only added rather than modified, so it will model specifications to 6NF, basically into the K-V structure model.

The anchor model consists of:anchors, Attributes, Ties, Knots , and the relevant details can be referred to anchormodeling-agile information modeling in evolving Data environments 0x02 Dimension Modeling One, what is dimension modeling

Dimension model is the Data Warehouse field Master Ralph Kimball advocated, his "Data Warehouse toolbox", is the Data Warehouse engineering field most popular warehouse modeling classic. Dimension modeling to analyze the requirements of decision building model, the construction of the data model for the analysis of demand services, so it focuses on how users can quickly complete the analysis requirements, but also a better response performance of large-scale complex queries.

Let's explain what dimension modeling is in a different way. The children's shoes who have studied the database should know the star model, and the star model is a typical dimension model. When we do dimension modeling, we build a fact table, which is the center of the star model, and then there's a bunch of dimension tables, which are outward-diverging stars. So what is a fact table, what is a dimension table, and the following are specifically explained. Two, the basic elements of dimensional modeling

There are some more important concepts in dimension modeling that understand these concepts and basically understand what dimension modeling is.

1. Fact Table

An action-type event occurring in the real world that produces measurable values that are stored in the fact table. From the lowest granularity level, the fact table row corresponds to a metric event and vice versa.

Well, having read this sentence, it is not easy to understand what the fact table is.

such as a purchase behavior we can understand as a fact, the following example. P1.png

The order sheet in the diagram is a fact sheet, and you can understand that he is an operational event that happens in reality, and each time we complete an order, we add a record to the order.

We can go back and look at the characteristics of the fact table, where there is no actual content in the dimension table, which is a collection of primary keys that correspond to a record in the dimension tables, respectively.

2. Dimension table

Each dimension table contains a single primary key column. The primary key of the dimension table can be used as a foreign key to any fact table associated with it, and of course, the description environment for the dimension table row should correspond exactly to the fact table row. Dimension tables are generally wide and are flat, non-standard tables that contain a large number of low grained text attributes.

The user table, merchant table, and timesheet in our diagram are all part of the dimension table, which has a unique primary key and then holds detailed data information in the table. 0x03 Practice

Here we will take the electrical business as an example, in detail the modeling method of dimensional modeling, and for example if the use of this model (this is still very important). first, the business scene

Let's say we work on an electric dealer's website, like a treasure, a certain east. We need to model the business here. Below we analyze a few business scenarios: the most typical scenario of the electricity merchant's website is the user's purchase behavior. The initiation of a purchase requires the participation of these individuals: the purchaser, the merchant, the commodity, the time of purchase, and the amount of the order. A user can initiate a lot of buying actions.

Well, based on these points, let's design our model. second, the model design

The following is the data model that we designed, and basically the same as before, except for the English, mainly for the back of the time to write SQL. [Uploading p3_414705.png ...]

I'm not going to explain the effect of each table anymore, so I'm just going to say why.

First, let's think about what we would normally do if we didn't design this way.

If it is me, I will design the following table. Believe it or not, I can list 50 fields. In fact, I personally think how to design such a table has its rationality, whether we are right or wrong, to say the pros and cons of both. P3.png

First of all, our dimension model: Data redundancy is small (because a lot of specific information exists in the corresponding dimension table, for example, the user information is only one) structure clear (table structure at a glance) easy to do OLAP analysis (data analysis used to be very happy) increase the cost of use, such as query to correlate multiple table data inconsistencies, For example, when the user initiates the purchase behavior, the data is inconsistent with the data stored in our dimension table.

Besides, we have the advantage and disadvantage of this fat table: business intuitive, in doing business, this kind of table is particularly convenient, directly to the business. Easy to use, easy to write SQL. Data redundancy is huge, really big, in hundreds of millions of of the user's size, his order behavior will be very scary grain stiffness, anything written dead, this table is too low reusability. iii. use of examples

The data model must be built to serve a better application, so let me give you an example to actually feel how to use our model.

demand : Find the total price of LV branded goods purchased by male users in Royal Park in 2016.

Implementation :

  SELECT
    SUM (Order.money) from order
    ,
    product,
    date, address
    ,
    user,
  where
    date.year = ' 2016 '
    and user.sex = ' male ' and
    address.province = ' Royal Park ' and
    product.name = ' LV '
0xFF Summary

Dimensional modeling is a very good modeling method, he has a lot of advantages, but we in the actual work is also difficult to fully follow its way to achieve, will have a choice, for example, for the business we will still need some wide tables, and sometimes there will be a lot of data redundancy.

Author:Dantezhao , Jane Book | CSDN | GITHUB

Personal homepage: http://dantezhao.com
The article may reprint, but must indicate the article original source and the author information by the hyperlink form

Author: Mudongus
Link: https://www.jianshu.com/p/17baa9f96ca7
Source: Jianshu
Copyright belongs to the author. Commercial reprint please contact the author to obtain authorization, non-commercial reprint please indicate the source.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.