Data warehouses-fact tables and dimension tables

Source: Internet
Author: User

This article mainly refers to the following article:
Http://www.cnblogs.com/47613593/archive/2009/02/20/1394581.html
http://jackwxh.blog.51cto.com/2850597/827968

1. The difference between a data warehouse and an operational database

The physical model of a data warehouse differs greatly from the physical model of a common operational database. The most obvious difference is: the operation of the database is mainly used to support the real-time operation, the performance and quality requirements of the database is relatively high, in order to prevent "garbage in,garbage out", usually design the operational database to follow the constraints of several paradigms, unless in a few cases for performance compromises, Can be redundant, and data warehouse data is derived from the data generated by the immediate operation, rather than directly from the immediate operation, so its data quality is guaranteed by the operational system, rather than a few paradigms to ensure. And in order to better track historical information and produce reports faster, there are a lot of redundant fields in the physical model of the Data Warehouse.

2. Basic concepts of fact tables and dimension tables

Simply speaking, the dimension table is the angle (dimension) in which you observe the object, and the fact table is what you want to focus on.
For example, to analyze the sales of products, you can choose to analyze by product category, or by time, such as the press. The analysis forms a dimension. This will have two dimensions: Product category and time.
Here are the two dimension table structures:
Product dimension table: prod_id, Product_Name, Category, Color, Size, price
Time dimension table: TimeKey, Season, year, Month, Date

The fact table is a result table generated by a dimension after data aggregation, for example:

Sales Fact Sheet: prod_id (Reference Product dimension table), TimeKey (Reference Time dimension table), SalesAmount (total sales, in currency), Unit (volume)


The recognition of fact data and dimension data must be based on specific subject matter. A fact table, a measure (measure) that stores facts, and a foreign key value that points to each dimension. The dimension table is used to hold the metadata of the dimension, that is, the description information of the dimension, including the hierarchy of the dimension and the member category.

The resulting data, the fact table usually contains a large number of rows. The main feature of the fact table is the inclusion of numeric data (facts), which can be summarized to provide the relevant units as historical data, each fact table contains a multiple-part index that contains the primary key for the correlation latitude table as the foreign key, and the dimension table contains the attributes of the fact record. The fact table should not contain descriptive information, nor should it contain any data other than the numeric Measure field and the related indexed field that makes the corresponding item in the fact and latitude table.
There are two types of measures included in the fact table: one is a measure that can be accumulated, and the other is a non-cumulative measure. The most useful measure is a cumulative measure, and its cumulative number makes sense. Users can get summary information by accumulating measures, such as the ability to summarize sales of specific items for a set of stores within a specific time period. Non-accumulative measures can also be used in fact tables, and single summary results are generally meaningless, for example, when measuring temperatures at different locations in a building, it is meaningless to accumulate the temperature of all the different positions in the building, but averaging is meaningful.
In general, a fact table is associated with one or more latitude tables, and a user can use one or more dimension tables when creating a cube with a fact table.
A dimension table can be viewed as a window into which the user analyzes the data, and the latitude table contains the characteristics of fact records in the fact table, some features provide descriptive information, and some attributes specify how the fact table data is aggregated to provide useful information to the analyst, and the dimension table contains a hierarchy of attributes that help summarize the data. For example, a dimension table that contains product information typically contains hierarchies that divide products into categories such as food, beverages, non-consumer goods, and each of these products is subdivided more than once until each product reaches the lowest level.
In a dimension table, each table contains fact attributes that are independent of the other dimension tables, for example, the Customer dimension table contains data about the customer. Column fields in a dimension table can divide information into hierarchical levels of structure.

3. Design principles for fact tables and dimension tables


Fact tables are used to store the skeleton content of a theme. Take the daily workload as an example, the workload may have the following properties: Work date, staff, work hours, overtime, work nature, whether the field, work content, auditor. So what is the backbone content? It is easy to see the length of work, overtime is the backbone, that is, the basic content of the workload theme, then the date of work, staff, the nature of work, whether the field, work content is the backbone of information? Careful analysis of the characteristics will find, date, personnel, nature, whether the field can be classified, such as the date of the year-month-day level, the staff also have a subordinate relationship, the field and normal work is also two types of work attendance record, and the length of work and overtime is not such a sense. So we generally list the attributes that can be categorized, become dimension tables, and maintain the referential relationship between facts and dimensions in the fact table.

All in all, the same tables that establish the primary foreign key relationship are the same. But the establishment of a dimension table is a hierarchy (though not a must, but also a typical feature), and the fact table is set up for the fact that already happened, is the archive of historical data, that is to say, should not be modified. Take the test department test software bug as an example. Each bug is a fact. The state of this bug may be designed to be new, transferred, repaired, rejected, etc. in the data dictionary. Then there is a field status in the bug table in the fact table. When a tester or developer changes the value of this state, how is it updated in the fact table? Is it a direct update to the status or what other way? Obviously, in order to be able to track the history of this bug, it should be to reinsert a new record. So what's the difference between this and the previous database design? You can see that for the original record and the newly inserted records, the other fields are all the same, i.e. all redundant. If you use BugID as the primary key, you will find that the primary key is redundant (of course, you can only delete the primary key before inserting). So it can be seen that the fact table generally does not have a primary key. The quality of the data is fully grasped by the business system.

The dimension table is usually a primary key, representing a single entity of the substance, and the other fields are generally hierarchical. For example, February 19, 2009 is the primary key, then it will be a year-month-day such a level, for the convenience of Statistics, year and year will not be in the aggregation of the time to calculate, but in the maintenance of records have been calculated. Is it worth the redundancy of these fields? It can be explained that the data in the dimension table is generally relatively small, which is less of a relative fact table. Because the fact table is increasing, and the dimension table grows slowly, the absolute number is not too large. If you want to do a group by year (TimeKey), then when the fact table and dimension table make a connection query, it will produce the same amount of data as the fact table, if there is no hierarchy of these dimension table, then one will increase the calculation (need to take the year according to the Time field), and the second is because the calculation , the index is invalidated. This cost is much larger than the introduction of redundant fields.

The primary key of the dimension table generally takes the type of the flag column of the integer value, which is also to save the storage space of the fact table.
---------------------
Always practiced everywhere
Source: CSDN
Original: 25553117
Copyright NOTICE: This article is for bloggers original article, reprint please attach blog link!

Data warehouses-fact tables and dimension tables

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.