Data Warehouse
The purpose of the Data Warehouse is to build an integrated data environment for analysis, providing decision support for the Enterprise (decision supports). In fact, the data warehouse itself does not "produce" any data, at the same time it does not need to "consume" any data, data from outside, and open to external applications, which is why it is called "warehouse", and not called "factory" reasons. Therefore, the basic architecture of data warehouse mainly consists of data inflow and outflow process, can be divided into three layers-- source data, Data Warehouse, data application:
It can be seen that data warehouse data from different source data, and provide a variety of data applications, data flow from top to bottom of the data Warehouse to open applications, and the Data Warehouse is only a platform for intermediate integrated data management.
The Data warehouse obtains the data from each data source and in the Data warehouse the data transformation and the flow can consider is the ETL (extracts extra, transforms transfer, loads the load) the process, the ETL is the Data Warehouse pipeline, may also think is the Data warehouse the blood, It maintains the data in the Data Warehouse metabolism, and the daily management and maintenance of the Data Warehouse is the most effort to keep the ETL normal and stable.
The following is a brief introduction of the Data Warehouse architecture of the various modules, of course, the Data Warehouse described here refers to the site Data Warehouse.
Data sources for data warehouses
For the Web site data Warehouse, click Stream Log is a main data source, it is the basic data of the analysis of the website, of course, the database data of the website is not very small, it records the data of this website operation and the result of various user operation, for analysis website outcome This kind of data to be more accurate Others are documents that may be generated outside the site and other types of data that are useful for company decisions.
Data storage for data warehouses
The source data is exported through the daily tasks of ETL, and is converted into the Data warehouse in the form of attributes. In fact, the process has been a great controversy, that is, the data warehouse need not to store the details of data, one side of the view is the Data Warehouse oriented analysis, so long as the storage of specific needs of multidimensional analysis model; the other side of the view is that the data warehouse to establish and maintain the details of data, Then aggregate and process detail data according to requirements to generate a specific analysis model. I prefer the latter point of view: The Data Warehouse does not need to store all the raw data, but the data warehouse needs to store the detail data, and the imported data must be collated and transformed to face the subject. simply explain the following:
(1). Why don't I need all the raw data? The Data Warehouse is for analytical processing, but some of the source data is of no value to the analysis or its potential value is much lower than the implementation and performance costs of the data warehouses needed to store the data. For example, we know that the user's province, the city is enough, as to where the user is likely to be only the logistics business concerns, or users in the blog comments may be only the text mining will be necessary, but the lengthy comment text exists in the Data warehouse is not worth the candle;
(2). Why do I need to save the detail data? Detail data is required, Data Warehouse analysis requirements will change at all times, and with the details of the data can be done status quo, but if we only store the data model based on certain requirements, then obviously for the frequent change of demand will be unprepared;
(3). Why theme-oriented? Subject-oriented is the first feature of Data Warehouse, which mainly refers to the rational organization of data to achieve analysis. For the source data, its data organization is diverse, like the clickstream data format is not optimized, the foreground database data is based on the OLTP Operations organization optimization, these may not be suitable for analysis, and organized into a theme-oriented organization is really conducive to analysis, For example, the click-Stream log is organized into pages (page), access (visit or session), User (Visitor) three topics, which can significantly improve the efficiency of the analysis.
The Data warehouse is based on the maintenance detail data to process the data, so that it can really be applied to the analysis. Mainly includes three aspects:
Aggregation of data
Aggregated data here refers to a simple aggregation based on a specific need (multidimensional data-based aggregation is reflected in the multidimensional data model), and simple aggregation can be aggregated data such as Total pageviews, Visits, Unique visitors, or Avg. Time on the page, Avg. time on site, and so on average data that can be displayed directly on the report.
Multidimensional Data Model
Multidimensional data model provides multi-angle and multi-level analysis applications, such as the sales star model based on time dimension and region dimension, and snowflake model, which can be used to cross-query the time dimension and geographical dimension, and the subdivision based on time and geography dimension. Therefore, the application of multidimensional data model is generally based on online analytical processing (online analytical process, OLAP), and the data mart for a specific demand group is built on the basis of multidimensional data model.
Business model
The business model here refers to the data model based on some data analysis and decision support, such as the user evaluation model I introduced before, the relevance recommendation model, the RFM analysis model, or the decision support linear programming model, inventory model, etc., and the processing of data in data mining can also be done here.
Data application of Data Warehouse
A previous article-the value of the Data warehouse introduced in the Data Warehouse four characteristics of the value embodiment, but the value of the data warehouse far more than this, and its value is really reflected in the Data Warehouse data application. Several of the applications listed in the graph do not contain all, in fact, all data-related extensibility applications can be based on the data warehouse to achieve.
Report Presentation
Reports are almost an essential type of data application for each data warehouse, presenting aggregated data and multidimensional analysis data to reports, providing the simplest and most intuitive data.
Ad hoc queries
In theory, all data in the Data warehouse (including detail data, aggregated data, multidimensional data and analysis data) should be open ad hoc query, Ad hoc query provides flexible data acquisition method, users can query the data according to their own needs, and provide the ability to export to Excel and other external files.
Data analysis
Most of the data analysis can be based on the business model of the building, of course, can also use aggregated data for trend analysis, comparative analysis, analysis, etc., and multidimensional data model provides a multidimensional analysis of the data base, while the detailed data to obtain some sample data for the specific analysis is a more common way.
Data mining
Data mining uses a number of advanced algorithms that allow the data to exhibit a variety of surprising results. Data mining can be based on the business model that has been built in the Data warehouse, but most of the time the data mining is directly from the detail data, and the Data Warehouse provides the data interface for the mining tools such as SAS, SPSS and so on.
Meta Data management
Metadata (meta date), in fact, should be called explanatory data, i.e. data data. It mainly records the definition of the model in the Data Warehouse, the mapping relation between each level, the data state of monitoring data Warehouse and the task running state of ETL. Metadata is typically stored and managed centrally through the metadata repository (Metadata Repository), whose primary purpose is to achieve synergy and consistency in the design, deployment, operation, and management of the Data Warehouse.
Finally do a ending, the data warehouse itself neither production data nor consumption data, but as an intermediate platform to integrate data storage; The difficulty of data warehouse implementation lies in the construction of the whole architecture and the design of ETL, which is the key to the daily management and maintenance. , and the real value of data Warehouse lies in its data application, if there is no valid data application, it loses the meaning of building data warehouse.
From for notes (Wiz)
Website Data Warehouse Overall structure diagram and introduction