I think the data Warehouse can help the enterprise to deal with the data problem in three ways: first, in an enterprise Data warehouse, you divide your data according to the subject area, which is often more stable.
Organizations that want to understand the "big data" concept need to make a choice between the traditional data warehouse concept and the existing Data Warehouse architecture, or the increasingly popular open source Hadoop distributed processing platform, or the combination of the two.
The third option seems to be most plausible for businesses that want to move from a simple bi report to a deep data mining and predictive analysis. James Kobielus, senior data management analyst at the Forrester Agency, recently interviewed us about how businesses can gain valuable insights from the rapidly changing mass of data. In this article, you will learn how to maximize the functionality of the existing Data Warehouse architecture, the strengths and weaknesses of Hadoop, and the development of every data warehouse vendor in the big Data age.
I've seen a few different definitions of big data, so how does Forrester understand this popular concept?
James Kobielus: The big data is in fact the concept of limiting extensible analysis, and the term "limit scalability analysis" seems to me to be the core of what people call big data. In a way, it can be summed up in three V: Volume, the amount of data, can make TB can be PB or even larger; velocity, data flow speed, real-time acquisition, conversion, query and access data, produced, data types, including various structured data, Unstructured data and semi-structured data. In terms of analysis, it refers to all datasets that are capable of mining and acquiring meaning.
How should the enterprise understand the concept of data warehouse so that the significance of large data can be clarified?
Kobielus: I think the Data Warehouse can help the enterprise to deal with the data problem in three ways: first, in an enterprise Data warehouse, you divide your data according to the subject area, and these subject areas are often more stable and will not change for a long time. such as the OLAP cube in the Data Warehouse architecture, whether it is physically or logically divided. In other words, your customer data is in one partition, the financial data in another, the HR data in the third, and so on. The advantage of doing this is to help you match downstream applications and users based on the relevance of the data. This is the core of data Warehouse database management, but also the most important way to deal with large data through data warehouse.
So what's the second way?
Kobielus: The second way is the concept of database analysis and the use of data Warehouse to perform data analysis, data cleansing, data mining or regression analysis. In other words, a full set of data mining is done, but it is performed inside the Data warehouse. This can help you deal with the data because you use data mining or regression analysis to fundamentally understand the dataset pattern. The data mining and statistical model professionals can then use the database mining (in-database data Mining) to populate downstream analysis data marts to visualize complex patterns. For example, they use those patterns to identify potential big customers, which can be limited to set them as sales targets. Using database analysis and techniques such as mapreduce, you can automate data mining within a highly concurrent, highly scalable database schema.
What is the current application situation in the database? Is every business going to use it?
Kobielus: Although not all people will use the database analysis technology, but we can see more and more enterprises have a strong interest in it. If your data mining scale is large, the analysis in the database is considered to be the best practice. As we all know, at present, a large number of actual production of data warehouses are oriented to operational business intelligence, they are more in the production of reports, the implementation of ad hoc query, etc., rarely data mining. However, with the increase of data volume, the necessity of data mining is highlighted, and the value of the analysis in the database will be reflected. The goal of leveraging this technology is to speed up and expand your data mining project, while keeping all mining consistent across the data warehouse based on a common set of reference data.
What is the third best practice?
Kobielus: The third is the data warehouse as the core of data governance, the main data can be reasonably maintained in the Data Warehouse. When your Data warehouse is the core of data governance and data cleansing, it can help you figure out all the information. Across the enterprise architecture, there may be hundreds of applications that add data to the Data warehouse. Data is like a flood that flows in real time, and data warehouses are the hubs that ensure that large datasets are reliably and appropriately used in downstream consumption.
In the large data spread today, the traditional data warehouse vendors have made what efforts?
Kobielus:teradata, Oracle-exadata, Ibm-netezza, Hp-vertica and so on are doing big data. A large number of data warehouse vendors can use grid or cloud architecture to extend their products to the PB level, but also a large part of the database can be completed analysis, that is, in large-scale parallel data warehouse grid or cloud environment implementation. They can also support data transformation and data cleansing within the enterprise Data Warehouse.
From most media reports today, Hadoop seems to be the best way to deal with big data challenges, what do you think?
Kobielus: If you want to handle big data, you need a combination of enterprise data warehouses and Hadoop to do it. I disagree that people think of Hadoop as the only lifeline to dealing with big data problems. In fact, today's enterprise Data Warehouse is basically able to do what Hadoop can achieve any function. The advantage of Hadoop, which is open source, is free, compared to the traditional enterprise Data Warehouse system, but it needs to be reminded that enterprise users should not overlook the many intangible maintenance costs of open source Hadoop. It can be said that Hadoop is the next 5-10 years the next generation of enterprise data Warehouse development of the biggest power.