Model design of Data Warehouse
A. Data Modeling Methodology
The design of the Data warehouse model follows the design principle of "top-down and gradual refinement".
The design of the model is divided into three stages:
1, conceptual model
The scope and use of the business, from the height of the abstract summary, that is, to divide the subject domain.
Generally divided into 8 subject fields:
Customer, service, service use, accounts, settlement, resources, customer service, marketing
Why divide the Subject field?
The topic domain is divided according to the application and needs of the business, and is used to achieve the purpose of data and business tight coupling.
2 , the logical model
Refine the topics in the conceptual model to define the relationship between the entity and the entity, and the attributes of the entity.
That is, the table is defined as the table's role, the table's fields are constrained. Form ER diagram.
These entities are designed based on business rules, so to speak, this phase is primarily about business. " Business-driven modeling "
3 , the physical model
In accordance with the logical model, in the database to build tables, indexes and so on. Data Warehouse, in order to meet the requirements of high performance, can increase the redundant, hidden table constraints and other anti-third paradigm Operation .
This stage, mainly for the database, hardware, performance.
Paradigm :
First paradigm: The fields of a database table are single attributes and cannot be divided.
Second paradigm: A partial function dependency of a non-critical field on any of the candidate key fields does not exist in the database table.
(Part of a function dependency is when some of the fields in a composite keyword determine non-critical fields). That is, all properties are required to depend on the primary key.
Third Paradigm : a non-critical field in a database table does not have a transfer function dependency on any of the candidate key fields.
The paradigm is backwards compatible.
For example:
Student ID Student Name Student Department Course ID Course name Score
60100 School of three education, Department of Psychology, 1 classes english_1 English 180
1) violate the first paradigm. Because: The student department can be divided into: College, Department, class
2) violate the second paradigm. Because: The key field is the student ID and course ID, but there is a "course ID" that determines the course name and course credits.
3) violate the third paradigm. Because: The key field is the student ID, but there may be name and credit dependent "course ID".
Star model and Snowflake model
First, they are all made up of a fact table and a set of dimension tables.
Star model, also known as dimension modeling .
The difference is:
Star Model : The dimension table is connected directly to the fact table, and the pattern is like a star.
such as counties and cities as the same dimension are in the city table.
* Dimension preprocessing, the dimension will be pre-classification, sorting and other preprocessing.
Snowflake Model : Some dimension tables are not directly connected to the fact table, but are passed through the dimension table, and the shapes are like snowflakes.
For example:
Figure 1: Star-shaped model
Figure 2 Snowflake Model
From the performance point of view, Star Model query performance is good.
To improve performance, you can allow violations of the third paradigm, proper redundancy, and the hiding of constraints between tables
Dimension modeling
The business dimension is fused into the data model, and the name dimension is modeled.
Or, for analysis convenience (commercial application requirements), the different levels of the same dimension (such as the city ID, county ID) are fused into the fact table (such as the User wide table).
The dimension model is also a star model.
It emphasizes the preprocessing of dimensions, the collection of multiple dimensions into a fact table, and the formation of a wide table , such as the user unified view above. Contains more than 20 dimensions. This makes it possible to combine the dimensions to form a flexible report query
B. Layered design principles
The Data Warehouse in telecom industry adopts the principle of layered design.
In general, it is divided into three layers: interface layer, intermediate summary layer and application layer.
Application Tier Data Mart Market data mart, data mining
Application Layer KPI Report, Cagnos, topic Analysis, Index Library
Middle-level Depth summary layer Information aggregation: User Unified view, 3G user Unified view, fixed-line user Unified view
Business development: User behavior, Value-added business, group business, International Business
Light Summary Level inventory summary, user attribute aggregation, cost summary, group customer summary, etc.
Interface Layer storage Layer interface backup, incremental transfer total, reduced I/O (sub-standard data and historical data)
Interface Layer Day interface, monthly interface, incremental interface, full-volume interface
Special emphasis is placed on:
The middle tier is the most important layer of data warehousing. Directly determines the performance of the Data Warehouse.
The general practice is to:
1) Data summary. Summarize the underlying data in small granularity by dimension
2) Information aggregation. Aggregates information from multiple tables in a single table. The benefit of this is to avoid using table associations to improve query performance.
C. Topic Domain design methodology
If layered design is a horizontal design principle, then the subject domain is the vertical processing method.
The specific approach is to divide the data into different subject domains from the perspective of business, high abstraction and generalization.
The benefits of a domain: tight coupling, easy data development, and ease of use.
The domain is to have obvious table naming rules, such as:
User Information Domain--user
Communication Behavior--call
Data Service--gprs
Accounting--bill
Customer Service--serv
Data architecture diagram of XX sub-system:
Design of Data Warehouse model