The purpose of data warehouse is to build an integrated data environment oriented to analysis, and to provide decision support for enterprises (Decision-support). In fact, the data warehouse itself does not "produce" any data, at the same time does not need to "consume" any data, data from the outside, and open to external applications, which is why called "Warehouse", not called "factory" reasons. Therefore, the basic structure of data warehouse mainly contains the process of data inflow and outflow, which can be divided into three layers--source data, data Warehouse, data application:
It can be seen from the figure that data warehouse data from different source data, and provide a variety of data applications, data from top to bottom into the data warehouse and open to the upper level, and the Data Warehouse is only a platform for intermediate integrated data management.
Data Warehouse from various data sources and data in the data Warehouse conversion and flow can be considered to be ETL (extraction extra, transformation transmits, load load) process, ETL is the Data Warehouse pipeline, can also be considered as Data Warehouse blood, It sustains the data in the Data Warehouse metabolism, and data warehouse day-to-day management and maintenance work of most of the effort is to maintain the normal and stable ETL.
The following is a brief introduction to the Data Warehouse architecture in the various modules, of course, the Data Warehouse described here is mainly refers to the Web site Data Warehouse.
Data Warehouse Data source
In fact, a previous article has introduced the data warehouse of various types of source data-Data Warehouse source data type, so this is no longer described in detail.
For Web site Data Warehouse, click Stream Log is a major source of data, it is the basis of the analysis of the site data; Of course, the site's database data is not small, its record of the operation of the site data and the results of various user operations, for analysis of the site outcome such data more accurate Others are documents that may be generated outside the site and other types of data that are useful to company decisions.
Data Storage
The source data is exported through the daily task scheduling of ETL, and is converted into the Data warehouse in the form of characteristics. In fact, this process has been a lot of controversy, that is, the data warehouse need not to store the details of the data, one side of the view of the data Warehouse-oriented analysis, so long as the storage of specific needs of the multidimensional Analysis model; The specific analysis model is then generated based on the requirements aggregation and processing detail data. I prefer the latter view: The Data Warehouse does not need to store all the raw data, but the data warehouse needs to store the detail data, and the imported data must be sorted and transformed to make it subject-oriented. Simply explained:
(1). Why don't you need all the raw data? The data warehouse is oriented to analytical processing, but some source data is not valuable for analysis or its potential value is much lower than the implementation and performance costs of the data warehouse needed to store the data. For example, we know that the user's province, the city is enough, as to where the user lives may only be concerned about the logistics business, or users in the blog comments may be only text mining will be necessary, but the lengthy comments on the text of the data warehouse will not outweigh the gains;
(2). Why do I need to save details? Detail data is required, Data Warehouse analysis requirements will change at all times, and with the details of the data can be quo, but if we only store the data model based on certain requirements, then obviously for the frequent change of demand will be unprepared;
(3). Why should it be subject-oriented? Theme-oriented is the first characteristic of data Warehouse, which means to organize the data rationally to realize the analysis. For the source data, its data organization is diverse, like the click Stream data format is not optimized, the data of the foreground database is based on the OLTP Operation organization optimization, these may not be suitable for analysis, and organized into a theme-oriented organizational form is really conducive to analysis, For example, the click Stream log will be sorted into pages (page), access (visit or session), User (Visitor) three themes, which can significantly improve the efficiency of the analysis.
The Data warehouse is based on the maintenance details data processing, so that it can be truly applied to analysis. Mainly includes three aspects:
Aggregation
of
data
Aggregated data here refers to a simple aggregation based on specific requirements (multidimensional data aggregation is embodied in the multidimensional data model), and simple aggregation can be aggregated data such as the total pageviews, Visits, Unique visitors of the site, or Avg. Time on the average data such as page, Avg. time on site, which can be directly displayed on the report.
Multidimensional Data Model
Multidimensional data model provides multi-angle and multi-level analysis application, such as sales star model and snowflake model based on time dimension and regional dimension, and can realize cross query in each time dimension and region dimension, and subdivision based on time and region dimension. Therefore, the application of multidimensional data model is based on On-line Analytical processing (online analytical process, OLAP), and the data mart oriented to the specific demand group will be built on the basis of multidimensional data model.
business model
The business model here refers to data models built on some data analysis and decision support, for example, I have introduced the user evaluation Model, association recommendation model, RFM analysis model, or decision support of the linear programming model, inventory model, etc. at the same time, data mining in the early data processing can also be completed here.
Data application
of
Data Warehouse
Previous article--the value of Data Warehouse introduces the value embodiment of the four characteristics of data warehouse, but the value of data warehouse is much more than that, and its value is reflected in the data application of Data Warehouse. Several of the applications listed in the diagram do not contain all the data-related extensibility applications can be implemented on a data warehouse basis.
Report Presentation
Reports are almost an essential type of data application for each data warehouse, presenting aggregated data and multidimensional analysis data to the report, providing the simplest and most intuitive data.
ad hoc query
Theoretically all data in the Data warehouse (including detail data, aggregated data, multidimensional data and analysis data) should be open to ad hoc query, Ad hoc query provides a flexible enough data access, users can query according to their own needs to obtain data, and provide export to Excel and other external files function.
Data Analysis
Most of data analysis can be based on the construction of business model, of course, can also use aggregated data for trend analysis, comparative analysis, correlation, and so on, and multidimensional data model provides a multidimensional analysis of the data base; At the same time from the detailed data to obtain some sample data for specific analysis is a more common way
Data Mining
Data mining uses advanced algorithms that allow data to show surprising results. Data mining can be based on the business model already built in data Warehouse, but most of the time data mining will start with the detail data, and the Data Warehouse provides data interface for mining tools such as SAS and SPSS.
Metadata Management
Metadata (meta Date), in fact, should be called explanatory data, that is, data. It mainly records the definition of the model in the Data Warehouse, the mapping relationship among each level, the data state of the Data Warehouse and the task running status of the ETL. Metadata repositories (Metadata Repository) are typically used to store and manage meta data uniformly, with the primary purpose of achieving synergy and consistency in the design, deployment, operation, and management of the Data Warehouse.
Finally do a ending, the data warehouse itself neither production data nor consumption data, but as an intermediate platform to store data integrated; The difficulty in realizing the data warehouse lies in the construction of the whole architecture and the design of ETL, which is also the key point in daily management and maintenance. And the real value of data Warehouse lies in its data application, if there is no valid data application, it loses the meaning of building data warehouse.
» This article uses the»in agreement, reprint please specify the Source: the Website Data analysis» "The basic structure of the Data Warehouse"