There are often misconceptions about the relationship between big data and cloud computing. And they will be mixed up to say, do one sentence of a straightforward explanation is: cloud computing is the virtualization of hardware resources, large data is the efficient processing of massive data.
Although the above sentence explanation is not very appropriate, but can help you to understand the difference between the two. In addition, if you make a more vivid explanation, cloud computing is equivalent to our computer and operating system, the vast number of hardware resources to be virtualized after the allocation, in the field of cloud computing, the current boss should be Amazon, can be said to provide a commercial cloud computing standards, and also noteworthy is the VMware ( In fact, from this can help you understand the relationship between cloud computing and virtualization, open source of the cloud platform the most dynamic is OpenStack;
Large data is equivalent to the "database" of massive data, and the development of large data field can be seen, the current large processing has been in the direction similar to the traditional database experience, the production of Hadoop allows us to use ordinary machines to establish a stable processing TB data cluster, The concept of traditional and expensive parallel computing has come to us, but it is not suitable for use by data analysts (because of the complexity of mapreduce development), so piglatin and hive appear (respectively, Yahoo! and Facebook), which adds, In the Big data field, Google, Facebook, Twitter and other cutting-edge internet companies have made a very positive and powerful contribution to our SQL-like operations, which operate in the same way as SQL, but the processing efficiency is very slow, absolute and traditional database processing efficiency is vastly different, So people are thinking about how in the large data processing is not just the mode of operation of SQL, and processing speed can be "class SQL", Google brought us Dremel/powerdrill technology, Cloudera (Hadoop, the most commercialized company, Cutting, the father of Hadoop, is responsible for technical leadership here Impala also appeared.
Overall, the future trend is, cloud computing as the bottom of the computing resources, supporting the upper level of the large data processing, and the development trend of large numbers of real-time interactive query efficiency and analysis capabilities, the use of Google a technical paper, "Move the mouse can be in the second level to operate the PB-level data" Isn't that exciting?
When talking about large data, the first is about the 4V characteristics of large data, that is, the type is complex, massive, fast and valuable. IBM talked about Big data when talking about 3V, no value for this v. In fact, we look at 4V more appropriate, value is the ultimate goal of large data problem resolution, the other 3V are for the value of the target service. With the concept of 4V, it is easy to simplify to understand the core of large data, that is, the overall structure of large data includes three layers, data storage, processing and data analysis. The type is complex and massive by the data storage layer solves, the fast and the timeliness request solves by the data processing layer, the value is solved by the analysis layer.
The data should be stored in the storage layer, and then the data model and data Analysis Index system should be set up according to the data requirement and target to produce value. The intermediate timeliness is accomplished by the powerful parallel computing and distributed computing capabilities provided by the intermediate data processing layer. The three layers coordinate with each other, so that large data will eventually produce value.
Data storage Tier
There are many ways to divide data, such as structured, semi-structured, unstructured, metadata, master data, and business data, and can be divided into GIS, video, file, voice and business transactions. Traditional structured database can not meet the requirement of data diversity, so two types are added on RDBMS, one is HDFS can be directly applied to unstructured file storage, one is NoSQL class database, which is used in structured and semi-structured data storage.
From the storage layer structure, relational database, NoSQL database and HDFs Distributed File system three kinds of storage methods need. Business applications choose different storage modes according to the actual situation, but for the convenience of storage and reading of the business, we can further encapsulate the storage layer to form a unified shared storage service layer to simplify this operation. From the user's point of view does not care about the underlying storage details, only concerned about the convenience of data storage and reading, through the sharing of data storage layer can be implemented on the storage and storage of the basic settings of the complete decoupling.
Data processing layer
The key problem of data processing layer lies in the complexity of data processing, which brings about the timeliness requirement after the mass storage, which is the problem to be solved.
In the traditional cloud-related technology architecture, the Hive,pig and hadoop-mapreduce framework-related technical content can be divided into the data processing layer. What I thought was that the ability to hive into the data analysis layer was not appropriate because the hive focus was on the split of the complex queries that were really being processed, the aggregation of the results of the query, and MapReduce itself realized the true distributed processing capability.
MapReduce only implements a distributed computing framework and logic, and the real analysis of the requirements of the split, the analysis of the results of the aggregation and consolidation of the need for the hive layer of capacity integration. The ultimate goal is simply to support the timeliness requirements of a distributed architecture.
Data analysis Layer
Finally, the analysis layer focuses on the value of real mining of large data, and the core of value mining is data analysis and mining. The core of the data analysis layer is still the content of the traditional bi analysis. Including data dimension analysis, data slicing, data drilling and drilling, cube and so on.
Data analysis I only focus on two content, one is the traditional data warehouse data modeling, in the data model needs to support the above analysis methods and analysis strategies, followed by business objectives and business needs of the establishment of KPI Index system, corresponding indicators system analysis model and analysis methods. Solving these two problems basically solves the problem of data analysis.
Traditional BI analysis through a large number of ETL data extraction and centralization, to form a complete data warehouse, and based on large data bi analysis, may not have a centralized data warehouse, or the data warehouse itself is distributed, the basic methods of BI analysis and ideas have not changed, But the data storage and processing methods that landed to execution have changed a lot.
Talking so much, the core still wants to explain big data two core for cloud technology and BI, leave cloud technology Big data have no foundation and landing possible, leave bi and value, big data again change to trifles, discard key goal. The simple conclusion is that the large data target driver is bi, and the large data implements the floor cloud technology.