Data Modeling for Data Warehouses

Source: Internet
Author: User

What is the difference between OLTP and data warehouse?

In
In daily life, we need to use a large number of applications to generate new data, change data, and delete data. Of course, in most cases, we need to check and analyze data. Just imagine sending and receiving
Email
Simple application. We have stored the address information and may have stored some documents. You can decide whether to store emails that have been sent, but you may delete them after a period of time.
All emails sent. So how can we deal with the addresses that have been deleted or modified some time ago? We will never see them again.

Email
Most programs are not very complex databases, but they can be viewed as a single-user environment.
A simple example of OLTP (online transaction processing system. It uses all the so-called Data Access Operations
Crud (create, read, update, and delete ). When the data storage reaches a certain level, the size will almost remain unchanged, because expired data can be deleted from the storage.

A data warehouse is a different type of application. It is not used to run the current operation, such as sending an email. It is used to analyze data and discover new values from existing data. It is mainly used to predict future situations. A data warehouse is not a general structure for solving all problems. It must be concentrated in a certain problem area, such as aviation services and customer benefits.

The data warehouse also has an interesting aspect, that is, the database itself is growing steadily. The data is not deleted or changed. We do not need to place redundant data out of the database (because the data added to the warehouse has gone through the data purification process, this process checks the correctness of the data) to reduce complexity while enhancing the read operation performance.

To analyze data in a data warehouse, the data is stored in a multi-dimensional structure called the star schema. If the star mode is extended, the snowflake mode is obtained. This White Paper will explain how to use IBM Rational Rose for star mode modeling and snowflake mode modeling.

Flight service data mart example

To better explain how to model a data warehouse, this White Paper uses a simple data mart example (a data warehouse or a part of a data warehouse ), to analyze the behavior and satisfaction of passengers taking the Happy flying and landing flight.

We will store the passenger information and relevant data for each flight, the menus selected, and the satisfaction of the passengers with the flight.

 



Back to Top

Data Warehouse Glossary

The data warehouse introduces new terms and extends the Glossary for data modeling. To make the description of this article complete, I will introduce the most common terms below.

Data Warehouse

A data warehouse is a collection of data that supports management decisions. Data is subject-oriented, integrated, hard-to-lose, and time variable.

A data warehouse is a collection of snapshots for all operating environments and external data sources. It does not need to be very accurate, because it must be extracted from the operating environment at a specific time.

Data mart

Data Warehouses are only limited to regions with a single topic, such as customers, departments, and locations. A data mart can depend on a data warehouse when obtaining data from a data warehouse, or it does not depend on a data warehouse when obtaining data from an operating system.

Fact

Fact is an information unit in a data warehouse and a unit in a multi-dimensional space. It is restricted by analysis units.

Facts are stored in a table (when a relational database is used) or a unit in a multi-dimensional database.

Each fact includes basic information about facts (income, value, satisfaction record, etc.) and is related to dimensions.

In some cases, when all the necessary information is stored in dimensions, the fact is that the data warehouse has sufficient information. We will discuss the lack of facts later.

Dimension

A dimension is the axis of a coordinate system bound to a space defined by a coordinate system. The coordinate system in the data warehouse defines data units, including facts.

An example of a coordinate system is a Cartesian (Cartesian) Coordinate System with X and Y dimensions.

In a data warehouse, time is always one of the dimensions.

Data Mining

The process of discovering new information in the data warehouse is called data mining, which is not obtained from the operating system.

Analytic Space

An Analytic space is a quantitative amount of data in a data warehouse. It is used for data mining to discover new information and support management decisions.

Slice

A technology used to restrict an analysis space in a dimension to a data subset in a data warehouse.

Block Cutting

A technology used to limit the analytic space of multiple dimensions to a data subset in a data warehouse.

Star Mode

A multi-dimensional analysis space using relational databases is called the star mode.

The star schema will be further discussed later in this White Paper.

Snowflake Mode

For whatever reason, when dimensions in the star mode need to be normalized, the star mode evolves to the snowflake mode.

 



Back to Top

Star Schema modeling using IBM Rational Rose

The basic form of the star mode must implement multi-dimensional space (often called blocks) to use the basic functions of relational databases.

First, we need to understand the multi-dimensional space.

Multidimensional analysis space

The square in the ry refers to a three-dimensional space in which each dimension has the same size. Imagine a cube. Each dimension has three units. We can get 33 = 27 units with the same structure.

Figure 1 a square with the X, Y, and Z dimensions

The multi-dimensional analysis space (or data warehouse square) and the square in the geometric space only have the difference in details.

  • Dimensions are not limited to three dimensions. However, it is not easy to process cubes in many dimensions. This will limit most implementations to six or seven dimensions. Don't look forward to using a graph to represent more than 4 dimensions-if you are lucky to find a way, don't forget to tell me.
  • Dimensions do not have the same scale and unit. The scale ranges from several units to several millions of units. A unit can be a day, a customer, or a department.
  • Unit, which is equivalent to a sub-square (1 × 1 × 1) and contains facts.

Figure 2 A Three-dimensional data cube

Data cubes require a large amount of memory to store all the facts. No matter whether it contains facts, you must reserve units.

This is why relational databases and star models are used. They optimize storage and maintain the flexibility of the data structure.

Star Mode

The basic idea of the star schema is to maintain the multi-dimensional functions of cubes and increase the flexibility of small-scale data storage.

Figure 3 A Star Schema

In Figure 3, the star mode uses the fact flight to represent a four-dimensional block (passenger, menu, flight schedulet, and time ). Basically, a fact must specify a dimension to put it into the cube unit.

The dimensions in our example are:

  • Passenger, which describes each passenger in a flight, specified by the frequent flight number (frequent flyer number. Passengers who are not often on airplanes are not part of the data warehouse.
  • Flight Schedule refers to the schedules of all regular flights.
  • Menu is a menu for flight. Data Mining is important only when menus are classified.
  • Time refers to the flight time.

The fact that flight describes a passenger's choice of menu on the only one-way flight of time.

The analysis space can be a complete square, or we can split the analysis space into small pieces based on dimensions.

Each dimension is described based on an object, which can be represented by a class. These classes are the names of related business topics. This is important for the successful establishment of a data warehouse, because the warehouse users (managers, analysts, and markets) are not very familiar with the terms of information technology.

The fact itself is another object of business intelligence, which is still represented by classes.

Fact indicates each dimension. The association between facts and dimensions is often arbitrary. This means that each fact corresponds exactly to a unit in a single dimension, and each unit in a dimension (passenger, time, etc) can be associated with any number of facts (including 0 facts ).

Using Rational Rose to convert an object model to a data model completes the implementation of the star mode. Here we can see the converted result.

Figure 4 star schema using Rational Rose

In Figure 4, The automatically created primary key and foreign key constraints are not displayed.

The dimension in the star mode is an independent table. When the object model is converted to a data model, Rational Rose can generate a primary key for the dimension.

A fact table is a dimension in which keys are migrated from a dimension table. When a data model is generated, Rational Rose can generate a foreign key.

In star mode, slice and slice are restrictions on the dimension (select ). This is a runtime issue, not a modeling issue, but the model must distinguish its needs.

Snowflake Mode

The basic star schema does not meet all data mining needs. We need more complex dimensions, such as time. Analysts want to identify the Pattern Based on weeks, months, and quarters.

Dimensions must be normalized. We do not need redundant dimension tables, which only makes data slices more complex. The pattern we get in this process is called the snowflake pattern.

Let's look at a simple snowflake model example. We normalize the time dimension to weeks, months, and quarters.

Figure 5 normalized time dimension

We want to use the additional canonicalized dimension to slice cubes: weeks, months, and quarters. In this example, we assume that the quarter is a parallel level of the month, which means that we cannot assume that the quarter is an aggregation of several months. For this reason, we will use a standardized table (a simple addition to OLAP queries) to pre-select the time dimension.

The final snowflake mode adds a canonicalized dimension.

Figure 6 snowflake pattern with time and fact flight with fan dimension

Of course, all dimensions can be normalized as in the time example, which leads to the emergence of a more complex data mart model.

The implementation model (Data Model) developed by Rational Rose from the snowflake model is complete.

Figure 7 data model of snowflake pattern with fan time dimension

The created constraints are not shown in the figure.

In snowflake mode, slices can exist, not only in the Basic time dimension, but also in the normalized week, month, and quarter dimensions.

Many-to-many relationship

During a flight, we not only eat one meal. You may have to dine multiple times during a long flight. In this case, we believe that the fact that the flight and menu dimensions are not one-to-many associations. We must use many-to-many associations. However, this association cannot be implemented in the star mode.

A special form of snowflake pattern is to use a necessary data structure to meet this requirement.

First, we change the model to multiple-to-multiple associations between facts and dimensions. Use Rational Rose, which is only a change to the associated base.

Figure 8 multi-to-multi-dimensional star mode of menu

We cannot implement many-to-many associations in relational databases. Another snowflake mode is required for multi-to-many Association.

In, let's take a look at a part of the developed snowflake model, which handles multiple to multiple dimensions.

Figure 9 the snowflake mode solves the problem of multi-dimensional menu

Rational Rose generates an additional dimension table flightmenu, which refers to the menu dimension and flight fact.

The relationship is used to resolve multiple-to-multiple associations.

For an architect of the snowflake model, the most important thing is to identify many-to-many relationships. The simple object view may make the designer understand the concept, and the generated data view will help to further explore the implementation issues.

Level

Data mining can discover information from data hidden on the operating system surface. One problem we want to know is the dependency between the menu and passenger statistics.

The passenger statistical data can be built at the passenger dimension level. Passengers can group by zip code and then by country.

Figure 10 passenger hierarchy

Layers are specified by using aggregation. Aggregation defines the included content. Country contains the zip code, which contains multiple passphrases.

Finally, the aggregation is implemented by using a foreign key.

Figure 11 aggregation of passenger dimensions in Snowflake Mode

The generated constraints are still not shown in the figure.

With aggregation, dimensions can be used at any defined level. The analysis space can be sliced by passpsenger, zip code, or country.

Consistent dimension

As data warehouse architects constantly add details, the snowflake model becomes increasingly complex. Therefore, the design process must be stopped to some extent to keep the data warehouse running well.

The star or snowflake pattern is still concerned with the fact that flight is used in this example. What is the complex relationship?

We must design their own models for every fact. If we want to perform complex queries, they must have a common dimension-we call it a consistent dimension.

Let's use pilot as a dimension and pilotflight as a fact to define the second star schema. We also need to use the additional flight schedule and time dimensions.

Figure 12 pilot star mode

The second mode can be used independently or in combination with passenger mode to query passenger satisfaction Based on the pilot dimension using the same dimension.


Figure 13 consistent dimension Time and flight schedule

Even in the simple structure of a data warehouse with the same dimension, the relationship between pilot and paslot is also simple.

When developing a data model, the Data Warehouse combines a large number of small star models with the snowflake model to form a large data warehouse model.

Fact and Dimension Data

We want to evaluate the passenger satisfaction rate for flight. You can use a few unsatisfactory levels for evaluation. Rating records are stored in the fact table flight as an attribute (column ).

If we want to obtain an average record, we must define a value for the record for calculation. We can divide records into 0 to 10 levels. In this way, an average record can be obtained. The average value should be stored in the dimension table for simple slicing, where we only want to perform one-dimensional slicing.

Rational Rose generates implementation Attributes Based on the Data Type of the target database. The object model is used to define the data source of the database.

 



Back to Top

Conclusion

IBM Rational Rose is the best tool for designing data warehouses.

The object model defines the global structure of the schema and the entire data warehouse including the data source. It represents the view objects in the data warehouse and hides the implementation details.

A Data Model is an implementation model of a data warehouse. The data model can be generated from the object model, and vice versa.

Rational Rose is the ideal tool for designing the star and snowflake modes. Rational Rose is flexible and flexible enough to support any concept of a data warehouse. It also provides the features required by data architects and database administrators to adjust the data warehouse.

Rational Rose provides powerful motivation for analyzing business and developing data warehouse requirements.

References

  • For more information, see the original article on the developerworks global site.

About the author

Davor Gornik has authored this article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.