Analysis and discussion on the basic structure of website Data Warehouse

Source: Internet
Author: User
Keywords Warehouse you can


Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall



The purpose of data warehouse is to build an integrated data environment oriented to analysis, and to provide decision support for enterprises (Decision Support). In fact, the data warehouse itself does not "produce" any data, at the same time does not need to "consume" any data, data from the outside, and open to external applications, which is why called "Warehouse", not called "factory" reasons. Therefore, the basic structure of data warehouse mainly contains the process of data inflow and outflow, which can be divided into three layers--source data, data Warehouse, data application:









It can be seen from the figure that data warehouse data from different source data, and provide a variety of data applications, data from top to bottom into the data warehouse and open to the upper level, and the Data Warehouse is only a platform for intermediate integrated data management.



Data Warehouse from various data sources and data in the data Warehouse conversion and flow can be considered to be ETL (extraction extra, transformation transmits, load load) process, ETL is the Data Warehouse pipeline, can also be considered as Data Warehouse blood, It sustains the data in the Data Warehouse metabolism, and data warehouse day-to-day management and maintenance work of most of the effort is to maintain the normal and stable ETL.



The following is a brief introduction to the Data Warehouse architecture in the various modules, of course, the Data Warehouse described here is mainly refers to the Web site Data Warehouse.



Data sources for the Data Warehouse



In fact, a previous article has introduced the data warehouse of various types of source data-Data Warehouse source data type, so this is no longer described in detail.



For Web site Data Warehouse, click Stream Log is a major source of data, it is the basis of the analysis of the site data, of course, the site's database data is not small, its record of the operation of the site data and the results of various user operations, for analysis of the site outcome such data more accurate; Others are documents that may be generated outside the site and other types of data that are useful to company decisions.



Data storage in a data warehouse



The source data is exported through the daily task scheduling of ETL, and is converted into the Data warehouse in the form of characteristics. In fact, this process has been a lot of controversy, that is, the data warehouse need not to store the details of the data, one side of the view of the data Warehouse-oriented analysis, so long as the storage of specific needs of the multidimensional Analysis model; The specific analysis model is then generated based on the requirements aggregation and processing detail data. I prefer the latter view: The Data Warehouse does not need to store all the raw data, but the data warehouse needs to store the detail data, and the imported data must be sorted and transformed to make it subject-oriented. Simply explained:



(1). Why don't you need all the raw data? The Data Warehouse is oriented to analytical processing, but some source data is not valuable for analysis or its potential value is much lower than the implementation and performance costs of the data warehouse needed to store the data. For example, we know that the user's province, the city is enough, as to where the user lives may only be concerned about the logistics business, or users in the blog comments may be only text mining will be necessary, but the lengthy comments on the text of the data warehouse will not outweigh the gains;



(2). Why do I need to save details? Detail data is required, Data Warehouse analysis requirements will change at all times, and with the details of the data can be quo, but if we only store the data model based on certain requirements, then obviously for frequent changes in demand will be overwhelmed;



(3). Why should we face the theme? topic-oriented is the first characteristic of data Warehouse, which mainly refers to the rational organization of data to achieve analysis. For the source data, its data organization is diverse, like the click Stream data format is not optimized, the foreground database data is based on the OLTP Operation organization optimization, these may not be suitable for analysis, and organized into a theme-oriented organizational form is really conducive to analysis, such as the click Stream Log into the page ( Page, Access (visit or session), User (Visitor) three themes, which can significantly improve the efficiency of the analysis.



The Data warehouse is based on the maintenance details data processing, so that it can be truly applied to analysis. Mainly includes three aspects:



Aggregation of data



Aggregated data here refers to a simple aggregation based on specific requirements (multidimensional data aggregation is embodied in the multidimensional data model), and simple aggregation can be aggregated data such as the total pageviews, Visits, Unique visitors of the site, or Avg. Time on the average data such as page, Avg. time on site, which can be directly displayed on the report.



Multidimensional Data Model



Multidimensional data model provides multi-angle and multi-level analysis application, such as sales star model and snowflake model based on time dimension and regional dimension, and can realize cross query in each time dimension and region dimension, and subdivision based on time and region dimension. Therefore, the application of multidimensional data model is based on On-line Analytical processing (online analytical process, OLAP), and the data mart oriented to the specific demand group will be built on the basis of multidimensional data model.



Business model



The business model here refers to data models built on some data analysis and decision support, for example, I have introduced the user evaluation Model, association recommendation model, RFM analysis model, or decision support of the linear programming model, inventory model, etc. at the same time, data mining in the early data processing can also be completed here.



Data application of Data Warehouse



Previous article--the value of Data Warehouse introduces the value embodiment of the four characteristics of data warehouse, but the value of data warehouse is much more than that, and its value is reflected in the data application of Data Warehouse. Several of the applications listed in the diagram do not contain all the data-related extensibility applications can be implemented on a data warehouse basis.



Report Show



Reports are almost an essential type of data application for each data warehouse, presenting aggregated data and multidimensional analysis data to the report, providing the simplest and most intuitive data.



Ad hoc query



Theoretically all data in the Data warehouse (including detail data, aggregated data, multidimensional data and analysis data) should be open to ad hoc query, Ad hoc query provides a flexible enough data access, users can query according to their own needs to data, and provide export to Excel and other external files function.



Data analysis



Most of data analysis can be based on the construction of business model, of course, can also use aggregated data for trend analysis, comparative analysis, correlation, and so on, and multidimensional data model provides a multidimensional analysis of the data base; At the same time from the detailed data to obtain some sample data for specific analysis is a more common way



Data mining



Data mining uses advanced algorithms that allow data to show surprising results. Data mining can be based on the business model already built in data Warehouse, but most of the time data mining will start with the detail data, and the Data Warehouse provides data interface for mining tools such as SAS and SPSS.



Meta-data management



Metadata (meta Date), in fact, should be called explanatory data, that is, data. It mainly records the definition of the model in the Data Warehouse, the mapping relationship among each level, the data state of the Data Warehouse and the task running status of the ETL. Metadata repositories (Metadata Repository) are typically used to store and manage meta data uniformly, with the primary purpose of achieving synergy and consistency in the design, deployment, operation, and management of the Data Warehouse.



Finally do a ending, the data warehouse itself neither production data nor consumption data, but as an intermediate platform to store data integrated, the difficulty in the implementation of the data warehouse is the overall architecture and ETL design, which is the day-to-day management and maintenance of the whole; The real value of data Warehouse is based on its data application, if there is no effective data application will lose the significance of building data warehouse.



» This article uses the»in agreement, reprint please indicate the Source: Website Data Analysis» "The basic structure of Data Warehouse"


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.