What is a data warehouse?
Bill inmon, father of data warehouse, was widely accepted in the definition proposed in building the Data Warehouse published in 1991, A data warehouse is a topic-oriented (subject oriented), integrated (integrate), relatively stable (non-volatile), and time variant data set, used to support management decisions.
A data warehouse is a process rather than a project; a data warehouse is an environment rather than a product. The data warehouse provides the current and historical data used by users for decision-making support. This data is difficult or cannot be obtained in traditional operational databases. Data Warehouse Technology is a general term for various technologies and modules to effectively integrate operational data into a unified environment to provide decision-making data access. All you do is to make it easier for you to query the required information and provide decision-making support.
Composition of Data Warehouses
1. Data Warehouse database
The database of the data warehouse is the core of the entire data warehouse environment. It is the place where data is stored and provides support for data retrieval. Compared with the manipulation-type database, it features support for massive data volumes and fast retrieval technology.
2. Data extraction tools
The Data Extraction Tool extracts data from various storage methods for necessary conversion and sorting, and then stores the data in the data warehouse. Access to different data storage methods is the key to data extraction tools. It should be able to generate COBOL programs, MVS job control languages (jcl), Unix scripts, and SQL statements, to access different data. Data conversion includes deleting data segments that are meaningless to decision-making applications, transforming them to uniform data names and definitions, calculating statistics and derived data, and assigning missing value data to default values; unify different data definitions.
3. Metadata
Metadata is the data that describes the data structure and creation methods in the data warehouse. It can be divided into two categories by purpose, technical metadata and commercial metadata.
Technical metadata is the data used by the design and management personnel of the data warehouse for development and daily management. Including data source information, description of data conversion, definition of objects and data structures in the data warehouse, rules used for data cleaning and data updating, and ing between source data and target data; user Access Permissions, data backup history, data import history, and Information Release history.
Commercial metadata describes the data in a data warehouse from the business perspective. Including: Business topic description, including data, queries, and reports;
Metadata provides an information directory (informationdirectory) for accessing the data warehouse. This directory fully describes what data exists in the data warehouse, how the data is obtained, and how to access the data. It is the center for data warehouse operation and maintenance. The data warehouse server uses it to store and update data. Users can use it to understand and access data.
4. Access tools
Provides a means for users to access the data warehouse. Data Query and report tools, application development tools, Manager Information System (EIS) tools, Online Analytical Processing (OLAP) tools, and data mining tools are available.
5. Data Marts)
A part of data independently from a data warehouse for a specific purpose or scope of application, also known as Department data or topic data (subjectarea ). In the implementation process of the data warehouse, you can often start from the data mart of a department, and then use several data marketplaces to form a complete data warehouse. It should be noted that when different data marketplaces are implemented, field definitions of the same meaning must be compatible, so that subsequent implementation of data warehouses will not cause great trouble.
Data Warehouse Management: security and privilege management; tracking data updates; data quality inspection; managing and updating metadata; auditing and reporting data warehouse usage and status; deleting data; copy, split, and distribute data, backup and recovery, and storage management.
Information Publishing System: send data or other related data in the data warehouse to different locations or users. The web-based information publishing system is the most effective way to deal with multi-user access.
A data warehouse is generated in order to further explore data resources and make decisions when a large number of databases exist. It is by no means a "large database ". The emergence of data warehouses is not to replace databases. Currently, most data warehouses are managed by relational database management systems. Databases and Data Warehouses complement each other.
In short, databases are designed for transactions and data warehouses are designed for topics.
Databases generally store online transaction data, while data warehouses generally store historical data.
Database Design is designed to avoid redundancy as much as possible. Generally, it is designed to comply with the rules of the paradigm. Data Warehouse design is intended to introduce redundancy and adopt an anti-paradigm design.
A database is designed to capture data. A data warehouse is designed to analyze data. Its two basic elements are dimension tables and fact tables. Dimensions are the definitions of these things, such as time, department, and dimension tables. The fact table contains the data to be queried and the dimension ID.
In terms of concept, it is a bit obscure. Any technology serves applications, which can be easily understood in combination with applications. Take banking as an example. The database is the data platform of the transaction system. Every transaction made by the customer in the bank will be written into the database and recorded. Here, we can simply understand it as using database accounting. A data warehouse is a data platform for analysis systems. It obtains data from the transaction system and summarizes and processes the data to provide decision-making basis for decision makers. For example, the current deposit balance of a bank's branch is what happens in a month. If there are more deposits and more consumption transactions, it is necessary to set up an ATM in the region.
Apparently, the transaction volume of a bank is huge, usually measured in millions or even tens of millions of times. The transaction system is real-time, which requires timeliness. It takes tens of seconds for the customer to save a sum of money, which requires the database to store data for a short period of time. The analysis system is post-event. It must provide all valid data within the specified time period. The data is massive, and the aggregation and calculation are slower. However, as long as the data can be effectively analyzed, the goal is achieved.
A data warehouse is generated in order to further explore data resources and make decisions when a large number of databases exist. It is by no means a "large database ". What are the differences between data warehouses and traditional databases? Let's take a look at the definition of data warehouse by W. H. inmon: a topic-oriented, integrated, time-related, and unchangeable data set.
"Theme-oriented": traditional databases mainly process data for applications and may not store data based on the same topic. Data Warehouses focus on data analysis and are stored Based on topics. This is similar to the difference between a traditional farmer's market and a supermarket: in the market, if they are sold in small quantities, cabbage, radishes, and coriander will be placed in a booth, while in supermarkets, cabbage, radish, and coriander are separated. That is to say, the food (data) in the market is collected (stored) by vendors (applications), while the supermarket stores food by type (with the same subject.
"Time-related": When the database saves information, it does not emphasize that there must be time information. The data warehouse is different. for decision-making purposes, the data in the data warehouse must indicate the time attribute. In decision making, the time attribute is very important. They are also customers who have purchased Nine-car products. One is that they have bought nine-car products in the last three months, and the other is that they have never bought nine-car products in the last year. This is different for decision makers.
"Unchangeable": the data in the data warehouse is not up-to-date, but comes from other data sources. The data warehouse reflects historical information, rather than the daily transaction data processed by many databases (some databases, such as the telecom billing database and even real-time information processing ). Therefore, the data in the data warehouse is rarely or never modified. Of course, adding data to the Data Warehouse is allowed.
The emergence of data warehouses is not to replace databases. Currently, most data warehouses are managed by relational database management systems. Databases and Data Warehouses complement each other.
Comparison between data warehouses and traditional databases
Traditional relational databases (RDB) follow the same relational model. Data (Records) are stored in tables and can be used in a unified Structured Query Language (SQL) for data query, its application is often called online transaction processing (OLTP). Its focus is to complete business processing and respond to customers in a timely manner. Relational databases can process large databases, but they cannot be simply stacked and used directly as data warehouses. A data warehouse is mainly used for multi-dimensional data. Therefore, it is also called a multi-dimensional database. Multi-dimensional database data is stored in arrays without uniform rules or uniform multi-dimensional models. It can only be classified by category. In terms of applications, multi-dimensional databases should have strong query capabilities. Multi-dimensional databases store a wide range of information, but because they complete online transaction analysis (OLAP ), therefore, we do not pursue instantaneous response time. Therefore, we will be recognized by the response in a limited amount of time. In fact, OLAP packages contain interactive data queries, along with a variety of analysis methods, such as drill-down or successful drill-down to the underlying details. Therefore, the information in the data warehouse can still be expressed in a specific table even though it is multidimensional. Although there is such a big difference between the data warehouse and the traditional database, the design of the data warehouse is not completely different, but the existing traditional data processing can be used to integrate the information from it, to construct a data warehouse meeting different needs. That is, data flows from dynamic and event-driven traditional work data to static and historical data warehouses. Theoretically, it can be done by strategically introducing expired data from work data. However, due to the limitation of actual storage capacity and technology, this is actually impossible. Therefore, data must be separated from the work data and filtered into the data warehouse. In view of the above factors, to ensure the performance of OLAP, data warehouses must be separated from traditional work data.
Reprinted from: http://www.haogongju.net/art/522901
Http://wiki.mbalib.com/wiki/%E6%95%B0%E6%8D% AE %E4%BB%93%E5%BA%93