There are two main ways to store data: Database and filesystem, and the object-oriented storage are developed behind, but the overall thing is to store both structured and unstructured data. DB is initially serviced for structured data storage and sharing. FileSystem storage and sharing is large files, unstructured data, such as pictures, documents, audio and video. With the increase of data volume, stand-alone storage can not meet the needs of structured and unstructured data, then in the era of cloud computing, distributed storage and distributed database solutions are presented.
1,file System, object-oriented storage
Then for unstructured data storage, Distributed File System (lustre), distributed object Storage System (CEPH/S3), Peer-to-peer Storage System (Oceanstore) are presented.
(1) Distributed file systems such as lustre are widely used in the field of HPC, with relatively high requirements for the underlying hardware of the file system, typically high-end storage such as Sans and disk arrays. So the IO performance of this kind of file system is relatively high, the cost is higher. Such storage products are generally used in banking, securities, petroleum, aerospace and other fields.
(2) object-oriented storage such as CEPH/S3 is a very hot and unstructured way of storing data. This is a way to adapt to the cloud computing tide, user storage is no longer the POSIX file system interface, but through rest and other ways to access the cloud data interface, users do not need to maintain and manage any storage devices. And the cloud services provider provides the appropriate SLAs to ensure IO performance and data reliability and availability, so this approach is popular with users. In the current tide of open Internet platform, a variety of APIs is the basis for users to build applications, and the service behind the API is based on the cloud provider. Cloud storage services are among the most important of the various cloud computing services. Promote separation of data and applications in the cloud computing world, such as Amazon's recommendation to use EC2 users to store data in S3 rather than on their own EC2 instance local hard drive. Recently, the most Amazon-like cloud computing service provider in the country Sheng the loss of user data, carefully studied the discovery is the user cloud host (similar to EC2 Instance) on the data is lost, the original cloud host has no backup mechanism, so the cloud host disk out of the problem will result in data loss. From the website of AWS (Amazon Web Service) and GCE (Google Compute Engine), the two top cloud services providers do not provide a local backup-to-disk mechanism for cloud hosts. We can understand that in this computing model, the cloud host EC2 instance equivalent to the CPU and memory of the computer, does not have the function of persistence, while the cloud storage S3 has the persistent storage function, which is equivalent to the hard disk of the computer. So you can think of the whole cloud as a new computer architecture.
(3) Like Oceanstore this peer-to-peer storage mode, in the sharing of audio and video files in the application of more mature. More typical include the Emule,maze. Audio and video files are generally relatively large, we are the equivalent of storing it in a large WAN/LAN storage pool, so that users can download things nearby, and users themselves as a server function for others to provide the source of downloads, save every time to the central server that download, bandwidth bottleneck resolved. But such storage generally does not guarantee latency, users download a movie may take a few minutes, then the user is not too concerned about the middle of the transmission speed is not what the peak trough, as long as the total bandwidth comparison higher line, according to the "file size/bandwidth = transmission Time" As long as the transmission time is minimal. But I think with the spread of cloud storage in cloud computing, the major cloud computing services providers are laying a number of storage and computing data centers across the country, thus dispersing bandwidth pressures from the main data center and creating a highly granular, peer-to-peer storage network. And with the development of CDN Technology, there is the continuous improvement of network bandwidth, Peer-to-peer This mode of development may be restricted.
2,database, Data Warehouse, Big Data
Say finish filesystem, it is time to say db. The original purpose of RDBMS design is to store relational data, that is, the strict requirements of various paradigms exist between stored data. But the data in the real world are not strictly standardized, in particular, more and more machines and human social activities generated data (such as user browsing the Internet log data, social network data, medical diagnostics data, traffic data, financial transaction data, E-commerce transaction data, etc.) are semi-structured or unstructured. These data also have the need for storage and analysis, and these data often contain great business value. The analysis of relational data in traditional RDBMS is a matter of DW in Data Warehouse, that is, the object of analysis is all kinds of relational data. The analysis of these unstructured data is not as simple as DW, in addition to our common DW functions, we have in the analysis of these "big data" there are regression, clustering, classification, association analysis and other machine learning needs, then in the "Big Data" era of the analysis platform is not as simple as the Data Warehouse DW.
Then the demand for structured and unstructured data storage and analysis gave birth to the NoSQL database. Referring to the NoSQL database, we found that there are two requirements for the current NoSQL database and RDBMS: OLTP and OLAP. Of course, I think these two terms are not very appropriate, after all, the trasaction of the OLTP in most NoSQL database is not provided, so these two words is only the image of the expression, from the strict sense is wrong. The common explanation of these two requirements is: the user's online storage access to data, off-line analytical application of data access. The former is mainly to the data crud operations, users online access to data is more concerned about access latency, while maximizing throughput. The latter is mainly a write multiple read operation of data, application access data is not too concerned about the delay, only care about throughput. Then we can map these two typical applications to the previous database DB and Data Warehouse DW. At the same time, in the era of "big Data", the application of data analysis is not only confined to DW, but also is an important requirement for the machine learning application which is represented by clustering, classification and correlation analysis.
The fact that data is stored and processed in the "Big Data" era in the Hadoop ecosystem, most people think of it as a data warehouse DW in a large data field as a substitute, and many more are using Hadoop for analytical applications that are represented by data mining and machine learning. But I feel like I can still see database db in the whole ecosystem. The following diagram is the main component of the Hadoop ecosystem and is architecturally oriented to analytical applications. But HBase has been used in some large companies to store real-time online data (Facebook's Unified messaging system and Apple's icloud). HBase as an online data storage and Access database The main problems are: The bottom HDFs ha has not stabilized; there is no mechanism to ensure data integrity (such as transactions or similar mechanisms); There is no uniform access interface (like SQL). It is believed that these problems have been solved and that the NoSQL database, represented by HBase, will go further in online data storage access.
In the analytical application market, Hadoop is invincible and has become the fact standard of large data analysis. At present, the most used is the RDBMS inside the relational data into the HDFS, and then use the MapReduce analysis (Taobao will be the user's transaction data from the RDBMS into the HDFs and then use Mr Analysis and mining); Or put the log data to HDFS use MapReduce for analysis (Baidu's search log analysis). But at present for these semi-structured or unstructured data metadata management is not very mature and unified, hcatalog in the graph is to improve this part of the function and developed, with it hive and pig use will be more convenient. The other is the data generated in the HBase, directly using hbase mapreduce for processing and analysis, this is a bit like Greenplum or Teradata based on RDBMS distributed database products. Since the analysis platform based on Hadoop has many computational requirements for machine learning, and many machine learning algorithms are computationally intensive or computationally data-intensive, the analysis platform based on Hadoop also has computational-intensive requirements. At the same time, because MapReduce is designed for off-line analysis, there is no advantage for real-time analysis. The timeliness of some data is very important, the analysis of real-time is critical, so we also need real-time computing engine.
Based on the above requirements, in the later version of Hadoop yarn will mapreduce into a unified resource management and task scheduling layer, will support openmpi,storm,s4,spark,mapreduce and other computing models, across real-time data processing, off-line data processing , compute-intensive and data-intensive applications. By that time the entire Hadoop ecosystem was a truly unified data storage and processing platform.