1. Introduction
Broadly speaking, a data warehouse is a type of database, which is maintained separately with the operational database of the Organization. The data warehouse system allows various application systems to be integrated to provide a solid platform for unified historical data analysis and support information processing.
Data warehousing is acollection of demo-support technologies, aimed at enabling the knowledgeworker (executive, manager, analyst) to make better and faster decisions.
A data warehouse is a "subject-oriented, integrated, time-varying, non-volatile collection of data that is usedprimarily in organizational demo-making" (William H. Inmon, 1996)
Be sure to differentiate data warehouse and datawarehousing (the process of building and using a data warehouse ).
Four keywords, topic-oriented, integrated, time-varying, and non-loss-prone, differentiate data warehouses from other data storage systems.
Topic-oriented: Data Warehouse focuses on Data Modeling and Analysis of decision makers; integrated: constructing a data warehouse inherits multiple heterogeneous data sources; Time-Varying: data storage provides information from a historical perspective (such as the past 5 to 10 years); Non-Easy: Data Warehouses are physically separated to store data (It only requires two data access operations: data initialization and data access)
Data Warehouses support on-line analytical processing, which is different from operating databases.Supported online transaction processing (on-line transaction processing ).
Note: distinguish OLTP from OLAP:
The main task of online database operations is to execute online transactions and query processing. Therefore, OLTP is oriented to customers (such as student scores). It usually manages the current data and uses the ER model and application-oriented database design. The access source is mainly composed of short atomic transactions.
OLAP is intended for knowledge workers and is used for data analysis. The OLAP system manages a large amount of historical data and provides a collection and clustering mechanism. Generally, the star and snowflake models and topic-oriented database design are used, most accesses to the OLAP system are read-only operations. Therefore, query throughput and response time are more important than transaction throughput.
To facilitate complex analysis and visualization, data in a data warehouse is usually modeled in multiple dimensions. Dimensions are hierarchical, such as day-month-quarter-year, and product-category-industry.
OLAP operations derollup (increasing the level of aggregation) and drill-down (decreasing the levelof aggregation or increasing detail) along one or more dimension hierarchies, selection (selection and projection ), and aggregate (re-orienting themultidimen1_view of data ).
Data Warehouses can be implemented on standard or extended relational database management systems, known as Relational OLAP (ROLAP) servers. In contrast, the multi-dimensional OLAP (MOLAP) server uses a special data structure to directly store multi-dimensional data.
2. Architecture and end-to-end Processing
Figure 1: Data warehouse architecture
Figure 2: a readable data warehouse architecture
A three-tier architecture is usually used: Front-End Tool (top-level)-OLAP Server (Middle Layer)-data warehouse server (bottom layer ).
The underlying data warehouse server is usually a relational database system. The middle-layer OLAP Server is typically implemented as a ROLAP model or MOLAP model. The top layer is the front-end client for data analysis and mining (such as trend analysis and prediction ).
3. backend tools and utilities
Backend tools are used to extract, clean, load, and refresh data. Data Extraction, usually collected by multiple heterogeneous external data sources; data cleaning, detection of errors in data, may be correction of their crops; data loading, sort, summarize, merge, and calculate views, check integrity, and create indexes and partitions. Refresh and disseminate updates from data sources to data warehouses.
4. Conceptual Models and front-end tools
In a multidimen=datamodel, there is a set of numeric measures that are the objects of analysis. examples of such measures are sales, budget, revenue, inventory, ROI. each ofthe numeric measures depends on a set of dimensions, which provide the contextfor the measure. for example, the dimensions associated with a sale amount canbe the city, product name, and the date when the sale was made. each dimensionis described by a set of attributes.
Figure 3: Multidimensional Data Model
5. Database Design Methods
Here we will discuss the design of the relational database pattern that affects multi-dimensional data attempts. Most data warehouses Use star schema to represent multidimensional data models. The database includes a fact table. The fact table contains all the dimensions, and each item points to each dimension table. Different columns in each dimension table indicate different attributes of the dimension.
Figure 4: Star mode example
The snowflake mode (snowflakeschema) is a variant of the star mode, in which some dimension tables are normalized, so data is further decomposed into additional tables. Dimensional tables in Snowflake mode may be normalized to reduce redundancy once, which is easy to maintain and saves storage space.
Figure 5: snowflake mode example
Complex applications may require multiple fact tables to share dimension tables. This mode can be seen as a collection of star patterns and thus called a fact constellation (fact constellation ).
6. Indexing Technology
A data warehouse may contain a large amount of data, so it is necessary to optimize the query response. First, data warehouses use redundant structures, such as indexes and materialized views ). In addition, you can use parallelization to optimize the query response time. You can use Bitmap indexes and connection indexes to index OLAP data.
Index Structure
Bitmapindexing is an alternative representation of the recordid (RID) List during bitmap indexing. The popularity of join indexing is derived from its application in relational database query processing.
Figure 6: bitmap index example
Materialized methods and OLAP index structure are designed to speed up data cube query processing.