Big data is another hot topic for the IT industry to keep abreast of cloud computing. "Big Data" refers to the number of large, difficult to collect, processing, analysis of the dataset, which is prone to storage problems, this article introduces a few major problems.
"Big data" usually refers to data sets that are huge, difficult to collect, process, analyze, and those that are kept in the traditional infrastructure for long periods of time. The "big" here has several meanings, it can describe the size of the organization, and more importantly, it defines the size of the IT infrastructure in the enterprise. The industry has an infinite expectation of large data applications. The more value the business information accumulates, the more it's worth, but we need a way to dig it out.
Why do we need large data now?
In addition to the ability to store more data, we have to face more data types than ever before. The sources of these data include online trading, social networking, automatic sensors, mobile devices and scientific instruments, among others. In addition to the fixed sources of data production, various transactions can also speed up the accumulation of data. For example, the explosive growth of social-type multimedia data stems from new online transactions and record-keeping practices. Data is always growing, but the ability to store huge amounts of data is not enough because it does not guarantee that we can successfully search for business value from it.
Data is an important factor of production
In the information age, data has become an important factor of production, like capital, labor and raw materials and other elements, and as a general demand, it is no longer limited to the application of certain special industries. Companies from all walks of life collect and use a large number of data analysis results to reduce costs as much as possible, improve product quality, improve production efficiency and create new products. For example, by analyzing data collected directly from the product test site, you can help the enterprise improve the design. In addition, a company can surpass its rivals by analyzing customer behavior in depth, contrasting a lot of market data.
Storage technology must keep up
With the explosive growth of large data applications, it has spawned its own unique architecture, and has directly driven the development of storage, networking and computing. After all, the special need to deal with big data is a new challenge. The development of hardware is ultimately driven by software requirements, and in this case, it is clear that large data analysis application requirements are affecting the development of the data storage infrastructure.
On the other hand, this change is not an opportunity for storage vendors and other IT infrastructure vendors. With the continuous growth of structured and unstructured data and the diversification of data sources, the design of storage systems has not been able to meet the needs of large data applications. As the storage vendors realized this, they began to modify the architecture design of block and file based storage systems to accommodate these new requirements. Here we discuss what attributes are relevant to the large data storage infrastructure to see how they meet the challenges of large data.
Capacity issues
Here the "large capacity" can usually reach the PB-level data scale, therefore, the mass data storage system must have a corresponding level of expansion capabilities. At the same time, the expansion of the storage system must be simple, you can increase the capacity by adding modules or disk cabinets, and even do not need downtime. Based on this demand, customers are now increasingly favoring the storage of scale-out architectures. Scale-out cluster structure is characterized by a certain amount of storage capacity of each node, in addition to the internal data processing capacity and interconnection equipment, and traditional storage system chimney-style architecture is completely different, scale-out architecture can achieve seamless and smooth expansion to avoid storage islands.
The big Data application, in addition to the scale of the data, means that there is a huge number of files. Therefore, how to manage the accumulated metadata of the filesystem layer is a difficult problem, and improper handling can affect the scalability and performance of the system, which is the bottleneck of traditional NAS systems. Fortunately, there is no such problem with object-based storage architectures, which can manage the number of files at level 1 billion in a single system, and are not plagued by metadata management like traditional storage. Object-based Storage systems also have wide-area scalability to deploy and form a large, cross-regional storage infrastructure in several different locations.
Latency issues
The "Big Data" application also has a real time problem. In particular, it involves applications related to online transactions or financial classes. For example, the online advertising service in the apparel sales industry requires real-time analysis of customer browsing records and accurate advertising. This requires the storage system to be able to support these features while maintaining a high response speed, as the result of response latency is that the system pushes "expired" advertising content to the customer. In this scenario, the Scale-out architecture's storage system can play an advantage because each of its nodes has a processing and interconnect component that can grow synchronously while increasing capacity. The object-based storage System can support concurrent data flow, and further improve data throughput.
There are many "big data" applications that require high IOPS performance, such as HPC high-performance computing. In addition, the popularity of server virtualization has led to a need for high iops, just as it has changed the traditional IT environment. In order to meet these challenges, various models of solid-state storage equipment emerged, small to simple within the server to do the cache, large to solid-state media, such as scalable storage system, etc. are booming.
Concurrent access once the enterprise recognizes the potential value of large data analysis applications, they will compare more datasets into the system, while allowing more people to share and use the data. In order to create more business value, enterprises tend to synthetically analyze the various data objects from different platforms. The storage infrastructure, including the global file system, can help users solve data access problems, and the global file system allows multiple users on multiple hosts to access file data concurrently, which can be stored on multiple different types of storage devices in multiple locations.
(Responsible editor: The good of the Legacy)