1. Concept of Data Warehouse
Authoritative data warehouse field W. h. inmon provides a brief and comprehensive definition of a data warehouse: a data warehouse is a collection of theme-oriented, integrated, time-varying, and non-loss-prone data. It is a decision-making process supporting management departments. According to this definition, the Data Warehouse has the following four key features:
1.1 subject oriented)Data Set
Data Warehouses are generally organized around topics such as products, vendors, and consumers. Data Warehouses focus on data modeling and analysis by decision makers, rather than processing daily operations and transactions. Therefore, the data warehouse provides a concise view of a specific topic, eliminating data that is useless for decision making.
1.2 integrated data set
A data warehouse is usually composed of multiple heterogeneous data sources, which may include relational databases, object-oriented databases, text databases, web databases, and general files.
1.3Time-Varying(Time variant) Data Set
Data storage provides information from a historical perspective. A data warehouse contains time elements, and the information it provides is always associated with time. Data in a data mining warehouse stores data in a time period, not just data at a certain time point.
1.4 Nonvolatile Data Set
Data Warehouses are physically separated from real-time application data in the operating environment. Therefore, no transaction processing, recovery, and concurrency control mechanisms are required. Data in a data warehouse usually requires only two types of operations: initial loading and data access. Therefore, the data is relatively stable and rarely or never updated.
To sum up, a data warehouse is a semantic consistent data storage that acts as a physical implementation of decision-making support data models and stores the information required for strategic decision-making. A data warehouse is also often considered as an architecture. By integrating data from different data sources, it supports structured and specialized query, analysis report, and decision making.
2. Data Warehouse type
The data warehouse type can be divided into the following three types based on the data types managed by the Data Warehouse and the scope of the enterprise problems they solve (EDW), Operational database (ODS), and data marketplace (datamart ). ① An enterprise data warehouse is a general data warehouse. It contains both a large amount of detailed data and a large amount of cumbersome or aggregated data, which is not easy to change and oriented to history. This type of data warehouse is used to make strategic or tactical decisions covering a variety of enterprise fields. ② Operational databases can be used for decision-making and support for work data, and can also be used as a transitional area when data is loaded to a data warehouse. Compared with EDW, ODS has the following features: ODS is subject-oriented and comprehensive; ODS is changeable; ODS only contains the current and detailed data, it does not include accumulative and historical data. ③ Data marketplace is a type of Data Warehouse. It can contain lightly accumulated and historical Department data and is suitable for the needs of a department in a specific enterprise. Several data marketplaces can form oneEDW(This will be highlighted in later sections ). With the development of data warehouses, software tools are upgraded rapidly and new products are emerging. To facilitate tracking of its technological development and better selection of relevant tools, the Data Warehouse constructor should collect extensive files and data in this area to make the best choice.
3. Comparison between data warehouses and traditional databases
the traditional relational database RDB adopts a consistent relational model, where data (Records) are stored in tables, in addition, structual Query Language (SQL) can be used for data query. Therefore, its application is often called online transaction processing (OLTP). Its focus is on business processing, respond to customers in a timely manner. Relational databases can process large databases, but they cannot be simply stacked and used directly as data warehouses. A data warehouse is mainly used for multi-dimensional data. Therefore, it is also called a multi-dimensional database. Multi-dimensional database data is stored in arrays without uniform rules or uniform multi-dimensional models. It can only be classified by category. In terms of applications, multi-dimensional databases should have strong query capabilities. Multi-dimensional databases store a wide range of information, but because they complete online transaction analysis (OLAP ), therefore, we do not pursue instantaneous response time. Therefore, we will be recognized by the response in a limited amount of time. In fact, OLAP contains interactive data queries, along with multiple analysis methods, such as drill-down or successful drill-down to the bottom-layer details. Therefore, the information in the data warehouse can still be expressed in a specific table even though it is multidimensional. Although there is such a big difference between the data warehouse and the traditional database, the design of the data warehouse is not completely different, but the existing traditional data processing can be used to integrate the information from it, to construct a data warehouse meeting different needs. That is, data flows from dynamic and event-driven traditional work data to static and historical data warehouses. Theoretically, it can be done by strategically introducing expired data from work data. However, due to the limitation of actual storage capacity and technology, this is actually impossible. Therefore, data must be separated from the work data and filtered into the data warehouse. In view of the above factors, to ensure the performance of OLAP, data warehouses must be separated from traditional work data.