First, what is a data warehouse
The traditional data warehouse concept is that data warehouse is a structured data environment for decision Support System (DSS) and on-line analysis application data source. is a strategic set of all types of data that is supported by the decision making process at all levels of the enterprise.
In simple terms, the previous data warehouse can only support strategic decision to support strategic decision and tactical decision (Tactica decision) transformation, such as real-time marketing, personalized services. The Data Warehouse, which serves both strategic decision and tactical decision, is called real-time Active Data Warehouse (real-time active WAREHOUSE,RTADW).
Real-time Active Data warehouse mainly integrates data including real-time data and historical data. RTADW is a relational environment data Warehouse that supports real-time updating of data, rapid response time, data query capability based on drillthrough, and dynamic interaction capabilities to support changing business needs. RTADW active service mechanism, mainly for data change
including coarse-grained and fine-grained conditions and the content of the data around the event generation and response, the active service mechanism mainly includes three element events, conditions, actions (Event-condition Condition-action,eca), Based on these three elements, the development of a service engine for active decision building is built on the basis of data Warehouse. The development of active decision service engine, the individual thinks the world martial arts is fast not broken, so
The ability to proactively analyze and process real-time events is a key consideration.
Second, the Data warehouse structure
The architecture of a data warehouse is mainly differentiated from the point of view of data modeling. OK, speaking of which, I'll simply take you to the common modeling approach. When you go to school, you should have studied the database paradigm. We just need to understand the second and third paradigms:
1, the second normal form (2NF): First is 1NF, the other contains two parts, one is the table must have a primary key; the second is that columns that are not included in the primary key must be completely dependent on the primary key, not just part of the primary key.
2, the third normal form (3NF): First is 2NF, the other is not primary key columns must be directly dependent on the primary key, cannot exist delivery dependencies. That cannot exist: non-primary key column A relies on non-primary key column B, and non-primary key column B depends on the primary key.
Paradigm is generally used in our application database design more than enough. Paradigm is a data modeling standard, the benefits of which are mainly as follows: three.
1. Reduce data redundancy (this is the main benefit, other benefits are the result of this)
2. Eliminate exceptions (insert exception, update exception, delete exception)
3. Make the data organization more harmonious ...
But the same special is also flawed, the advantages will have its drawbacks:
1 query to connect multiple tables, increase the complexity of the query
2 query needs to connect multiple tables, reduce the performance of database query
The current situation, disk space costs can be negligible, so the data redundancy caused by the problem is not the reason to apply the database paradigm.
Therefore, not the higher the application paradigm, the better, depends on the actual situation. The third paradigm has largely reduced data redundancy and reduced the number of insertions, updates, and deletions. My personal view is that most of the cases applied to the third paradigm are sufficient, and in some cases the second paradigm is also possible.
In the Data warehouse, the data redundancy and query efficiency are modeled in the actual application scenarios, which is the topic of comparative game. The actual data warehouse is the choice in the second and third normal forms.
With more than one field in the case of the primary key, there is a direct choice between the first and second normal forms, which is the familiar dimension modeling, and the dimension modeling is mainly divided into star pattern and snowflake pattern.
1, star mode
--Query Performance advantages
-Star mode is modeled according to business model, with the advantage of business model
2, Snow pattern
--attribute in many cases, the star pattern is layered into the snowflake model, that is, the star pattern is further layered and the data redundancy is reduced.
To say so much, simply to answer two questions: Repeatability & interactivity.
The two issues correspond to two entities, data warehousing & data marts. Let's compare the differences between the two entities.
Everyone saw two graphs combined with the above basic knowledge should be understood, I will not say more.
The Data warehouse is divided into two main factions, namely Inmon and Kimball's great debate.
Bill Inmon defines a data warehouse as "a theme-oriented, integrated, time-changing, non-volatile data collection for decision processes that support management"
The corresponding schema is the bus architecture in the Data Warehouse, as shown in the following illustration:
Second, Ralph Kimball said that "The data warehouse is only the union of its data mart", and he believes that "the data warehouse can be built incrementally through a series of data marts of the same number of dimensions"
Its corresponding data warehouse architecture is the integrated architecture, as shown in the following illustration:
The combination of the above two architectures is the hybrid architecture, as shown in the following illustration:
In simple terms, architecture is chosen even if the third paradigm modeling and dimension modeling are directly selected.
At present, in the Internet, the real Data Warehouse is the business in front of the run, the construction of the platform if the waterfall method, must have a faster speed, to catch up with business changes to build. So we should adopt the method of concentrating all the superior forces and focusing on the breach. And iterative methods, the need to maintain a relatively stable personnel in the situation, the concept of unity, adherence, through continuous improvement is to do the implementation, the implementation cycle is longer, less impact, can be timely to make appropriate adjustments. Therefore, in the current Internet companies recommend the use of bus architecture, that is, dimension modeling, the construction of data implementation of the active Data Warehouse.
Share ppt:http://www.slideshare.net/limap52/ss-42553791