Four key features of data warehousing:
1. Thematic nature
Data Warehouse is generally from the user's actual needs, the different platform of the data source according to set the theme of the integration, and the traditional transaction-oriented operational database, with a high degree of abstraction. The topic-oriented data organization is a complete, uniform and consistent description of the analytical object data at a higher level, which can fully and uniformly characterize the various data related to the enterprise involved in each analysis object, as well as the connection between the data.
2. Integration of
Most of the data stored in the data warehouse originates from the traditional database, but it is not simple to import the original data directly, but it needs preprocessing. This is because data in transactional data is generally noisy, incomplete, and non-uniform in data form. The direct import of these "dirty data" will confuse data mining based on the Data warehouse. Dirty data must be extracted, purged, and transformed before it enters the data warehouse to generate a collection of data that is subject-oriented from transaction orientation. Data integration is the most important and complex step in the construction of Data warehouse.
3. Stability
Data in the data warehouse is mainly for decision-makers to provide data basis for analysis. The data that is based on the decision is not allowed to be modified. That is, after the data is saved to the Data Warehouse, the user can only query and analyze through the analysis tools, and cannot be modified. Data update upgrade is mainly done in the data integration process, the outdated data will be directly filtered in the Data warehouse.
4. Dynamic nature
Data Warehouse data is updated periodically over time and is not updatable for applications where data is not updated when user analytics is processed. After a fixed interval of time, the data generated in the running database system is extracted and converted into the Data warehouse. Over time, data is continuously synthesized at a higher level of synthesis to meet the requirements of trend analysis. Delete data from the Data warehouse when the data exceeds the storage age of the data warehouse, or when the analysis is useless. With regard to the structure and maintenance information of the data warehouse in the metadata (Metadata) of the Data Warehouse, the maintenance of the Data warehouse is performed automatically by the system according to its definition or maintained by the system administrator on a regular basis.
Several data models for the Data warehouse:
1. Star-shaped model
Star model is the recommended modeling method in Data mart dimension modeling. The star model is centered on the fact table, with all dimension tables connected directly to the fact table, like the stars. The star model is characterized by intuitive data organization and high execution efficiency. Because in the Data mart construction process, the data after the preprocessing, for example by the dimension summarizes, sorts and so on, the data quantity reduces, the execution efficiency is relatively high.
2. Snowflake model
Snowflake models are also an option in dimensional modeling. The dimension table of a snowflake model can have other dimension tables, although this model is more canonical than the star model, but because this model is not easy to understand, the maintenance cost is high, and the performance aspect needs to correlate the multi-layer dimension table, the performance is also lower than the star model. So it's usually not very common.
3. Paradigm Modeling
The modeling of the third paradigm is a modeling method used in database modeling, which is characterized by its systematic, good extensibility, avoiding redundancy and avoiding updating anomalies. Therefore, in the EDW layer modeling of Data Warehouse, we also advocate the use of the third paradigm modeling. However, the integration of data warehouse and the characteristics of historical changes mean that the data volume is very large, the correlation between table and table is inefficient, so sometimes fully canonical paradigm modeling is not the best choice, usually we will choose non-canonical processing, add some redundant fields to avoid the number of associations between tables, This will save a lot of time.
4. Constellation Model
The constellation model is an extension of the star model, and the star model is based on a fact table, and the constellation model is based on multiple fact tables and shares dimension information. It is also a good choice to build a constellation model by building a consistent dimension. For example, the details table and the summary table shared dimension of the same topic, and the fact tables of different topics can be added to each other on the dimension to generate the dimensions that can be shared.
Data Warehouse Basics