The intermediary transaction SEO diagnoses Taobao guest stationmaster buys the Cloud host technology Hall
Previous article--Web site data analysis of some of the issues in 2, the main collation of the bi-related issues, this article is mainly to organize some data warehouse related issues. Because recently looked back at some data warehouse of information and books, want to put forward and current problems encountered in the present (blog about data Warehouse related content please refer to the Web site Data Warehouse this directory), at the same time they also have the knowledge of the data warehouse in the reorganization and understanding, And for a long time not in the blog to send a new article, can not let oneself too lazy.
I have seen Inmon's "Build Data Warehouse" and "DW 2.0", while another data warehouse Master Kimball's "Data Warehouse life cycle toolbox" has no time to read, only recently to see most of the time, can not wait to write something. In fact, the field of data warehousing generally believe that the theory of Inmon and Kimball is antagonistic, the two in the construction of data warehouses in the direction of the difference has been debated, who can not convince who is the best method. I don't know when to extract from my Evernote notes. The general description of the two views, very concise and clear:
Inmon vs Kimball
Kimball–let Everybody build what tightly want when tightly want it, we'll integrate it all and if we need to. (bottom-up approach)
Pros:fast to build, quick ROI, nimble
Cons:harder to maintain as a enterprise resource, often redundant, often difficult to integrate data marts
Inmon–don ' t do anything loop for you ' ve designed modifiable. (Top-down approach)
Pros:easy to Maitain, tightly integrated
Cons:takes way too long to deliver-a-projects, rigid
Actually looked at the Data Warehouse life cycle toolbox, found that the views of the two are not so substantial differences, perhaps with the continuous development of data Warehouse, the two in the overall framework of the convergence slowly. Basically, the direction of building a unified enterprise Data Warehouse is consistent, while the Inmon is biased towards the underlying data integration, while the Kimball tends to be from the top level of demand, which may be related to both the project and the location in which they are located.
With this high quality generalization, the first question----------------------------------you're more inclined to build data warehouses (bottom-up or Top-down), what are the advantages and disadvantages?—— You don't have to ask,
Q1, Data Warehouse technology solutions, where are the advantages of these solutions, bottlenecks?
With the continuous development and maturity of data warehouse, the concept of "big data" is popular, there are more and more related products, the most common technical solutions include Hadoop and Hive,oracle,mysql Infobright,greenplum and NoSQL, or multiple combinations.
In fact, there are two categories: one is the use of traditional RDBMS-led database management data, Oracle, MySQL, etc. are based on the traditional relational database, the advantage is that there is more rigorous data structure, relational database data management more standardized, the process of data processing may occur in the inhuman error is very small, and standard SQL interface makes the cost of data is lower, the query and data is more flexible and efficient, but the disadvantage is also obvious, the capacity of processing and storage of massive data is insufficient, when the amount of data reaches a certain level, there will be obvious bottleneck. But the text-based distributed processing engine, Hadoop, Greenplum and NoSQL are all based on the processing and storage of text data, the advantage is the powerful data processing ability, the distributed architecture supports parallel computation, and has the super extended extension ability; The disadvantage is that the upper interface is inconvenient, Therefore, the hive and Greenplum Upper PostgreSQL of Hadoop are all to solve the problem of data interface, and the query and data is very difficult to respond in real time and lack of flexibility.
Q2, Data Warehouse should be saved aggregation data, the detail data should not be put into Data Warehouse?
In fact, this problem has basically reached a consensus, if it is to build enterprise-class Data Warehouse, then the details of the integration and storage is essential, but there are many in the real world directly from the external data source after the calculation of aggregation import data Warehouse instances. If the data warehouse is only lightweight applications, only the aggregation of data is understandable, after all, no one to stipulate the data warehouse must be how, the ultimate goal is to meet the data support and needs.
However, for the long-term development of enterprises, data warehousing has two advantages of storing detail data: On the one hand, from the technical level, data Warehouse storage details can release the query pressure of the front database, while the text class data and external document class data warehousing management more standardized, Data warehouses retain historical and immutable features that allow information to be kept from being lost; on the other hand, from the use of data, Data Warehouse makes the data and use more simple, integration detail data so that a large number of text-type data can be queried, can be associated, and the theme-oriented design for the presentation and analysis of data more directional and purposeful, and detail data is necessary to support data analysis and data mining applications. Therefore, the storage of detail data is essential if the data warehouse is to continually generate greater value.
Q3, you will divide the data warehouse into several layers, what is the data function of each layer?
There is no standard answer, according to the data warehouse in the complexity of data and the need for data use, Data Warehouse can have no level division.
I usually draw the data warehouse into three layers: the lowest level of detail data, management strategy is to optimize the storage, the general storage of imported raw data, easy to carry up the statistical summary, because the large amount of data need to optimize storage; The middle layer is multidimensional model, the management strategy is the design of optimization structure and query, the multidimensional model of the theme, Need to meet the diverse needs of OLAP and data queries, at the same time to ensure the convenience of the query, the key in the design and dimension of the dimension table and the selection and combination of the fact that the table needs to focus on storage and index optimization; The top layer is the presentation of data, management strategy is the optimization of efficiency, usually stored daily need to show the summary report or according to the multidimensional model assembled view, the presentation layer of data needs to be displayed at the fastest speed, generally used for the BI platform dashboard and reports.
Q4, Data Warehouse build the most complicated things are what, the most easily missing is which piece?
Always feel that the core of the Data warehouse is not the data integration, of course, data integration is the premise of the value of the Data warehouse, the real value of the data warehouse is reflected in the effective application of data, data from the business reaction to the business. And the core of building data warehouse is Data Warehouse architecture and data model design, how to weigh the data storage and data efficiency of the contradiction between data Warehouse management difficulties, this difficulty any data warehouse will exist, and large data increase the difficulty of this trade-off. Data integration and data quality control are the most complicated things in data warehouse construction, in particular, the process of data cleansing, I have also written several data quality control articles, but the reality of this process is much more complex, and for the upper level of data output accuracy and effectiveness, this work has to do, and to do as much as possible.
The most easily missing in building a data warehouse is the management of metadata, few data warehouse teams have complete metadata, and of course the engineers who build the Data warehouse are the living metadata, but the metadata is indispensable for both the data and the Data Warehouse's own team. On the one hand, metadata provides a complete data warehouse use document for the data demand side, help them to get the data on their own, on the other hand, Data Warehouse team members can be freed from the day-to-day interpretation of data, whether it is to continue iterative updating and maintenance of the latter or training new employees, are very good, Metadata enables the application and maintenance of data warehouses to be more efficient.
Written in the last: the above represents a personal point of view, welcome everyone to shoot bricks, more hope that experts can give valuable answers in the comments, any angle of view and discussion can, brainstorming.
Some questions about website data Analysis 1
Some questions about website data Analysis 2