In short, the database is a transaction-oriented design, the Data Warehouse is a theme-oriented design.
Databases generally store online transaction data, and data warehouses typically store historical data.
Database design is to avoid redundancy as far as possible, generally adopt the rule of conforming to the pattern design, Data Warehouse design is intentionally introduce redundancy, adopt the way of inverse paradigm to design.
The database is designed to capture data, and the Data Warehouse is designed to analyze the data, and its two basic elements are dimension tables and fact tables. Dimension is to look at the angle of the problem, such as time, department, the dimension table is the definition of these things, the fact table is the data to be queried, and the ID of the dimension.
Conceptually, it's somewhat obscure. Any technology is for the application service, combined with the application can be easily understood. Take the banking business for example. The database is the data platform of the transaction system, and every transaction that the customer makes in the bank is written to the database, which is recorded, and can be simply interpreted as a database accounting. The Data Warehouse is the data platform of the analysis system, it obtains the data from the transaction system, makes the summary, the processing, provides the decision-making basis for the decision-maker. For example, how many transactions occur in a branch of a bank for one months, and what is the current deposit balance of the branch? If there are more deposits and more consumer transactions, there is a need for an ATM in the region.
Clearly, banks ' trading volume is huge, usually measured in million or even millions. The transaction system is real-time, which requires timeliness, the customer saves a sum of money need dozens of seconds is unbearable, this requires the database can only store a short period of time data. The analysis system is an afterthought, and it provides all the valid data for the time period of attention. The data is massive and the aggregate is slower to calculate, but as long as you can provide effective analytical data to achieve the goal.
Data Warehouse, in the case of a large number of databases exist, in order to further mining data resources, for decision-making needs, it is by no means the so-called "large database." So what is the difference between a data warehouse and a traditional database? Let's take a look at W.h.inmon's definition of a data Warehouse: A topic-oriented, integrated, time-dependent, and non-modifiable collection of data.
"Theme-oriented": traditional database is mainly for the application of data processing, not necessarily according to the same topic storage, data Warehouse focused on data analysis work, is based on the theme of storage. This is similar to the difference between traditional farmers ' markets and supermarkets-in the market, cabbage, radish, and parsley will be on a stall, if they are a small sale, and supermarkets, cabbage, radish, parsley are each. In other words, the market's food (data) is stored according to the Hawker (application), the supermarket is in accordance with the type of dish (the same subject) to the heap.
"Time-related": When the database saves information, it does not emphasize that there must be time information. Data Warehouse is different, for decision-making needs, data in the Data warehouse must be marked with TIME attributes. Time attributes are important in decision-making. The same is the cumulative purchase of nine car products customers, one is the last three months to buy nine cars, one is the last year has never bought, this is different for policymakers.
"Non-modifiable": Data in the Data warehouse is not up-to-date, but comes from other data sources. Data warehouses reflect historical information and are not the kind of daily transactional data that many databases deal with (some databases, such as telecom billing databases, even processing real-time information). As a result, data in the data warehouse is rarely or never modified; Of course, adding data to the Data warehouse is allowed.
The advent of the Data warehouse is not to replace the database. At present, most of the Data warehouse is managed by the relational database management system. It can be said that databases, data warehouses complement each other.
To add, the purpose of the Data Warehouse scheme is to provide front-end query and analysis as the basis, because there is a large redundancy, so the need for storage is also large. In order to better serve the front-end application, the data warehouse must have the following advantages, otherwise it is a failed Data warehouse scheme.
1. Efficiency is high enough. Customer requirements of the analysis data are generally divided into days, weeks, months, seasons, years, etc., it can be seen that the date of the cycle of data requirements of the highest efficiency, requiring 24 hours or even 12 hours, customers can see yesterday's data analysis. Because some enterprises daily data volume is very large, poorly designed data warehouse often problems, delayed 1-3 days to give data, obviously not.
2. Data quality. Customers want to see all kinds of information, sure to accurate data, but because the Data warehouse process is divided into 3 steps, 2 ETL, complex architecture will be more layers, then because the data source has dirty data or code is not rigorous, can lead to data distortion, customers see the wrong information can lead to analysis of the wrong decision, resulting in loss, Rather than benefits.
3. Extensibility. The reason why some large Data Warehouse system architecture design is complex, because of the expansion of the next 3-5 years, so that customers do not have to spend too quickly to rebuild the Data Warehouse system, can be very stable operation. Mainly embodied in the rationality of data modeling, the Data Warehouse scheme more out of the middle layer, so that the massive data flow has enough buffer, not much data volume, it will not run up.
The difference between a data warehouse (a few silos) and a database