"Big data" usually refers to data sets that are huge, difficult to collect, process, analyze, and those that are kept in the traditional infrastructure for long periods of time. The "big" here has several meanings, it can describe the size of the organization, and more importantly, it defines the size of the IT infrastructure in the enterprise. The industry has an infinite expectation of large data applications. The more value the business information accumulates, the more it's worth, but we need a way to dig it out.
Perhaps people's impressions of big data come mainly from the cheapness of storage capacity, but in fact, businesses are creating a lot of data every day, and more and more, and people are trying to search for valuable business intelligence from the myriad of data. On the other hand, the user will also save the data that has been analyzed, because the old data can be compared with the new data collected in the future, still have potential utilization.
Why the Big Data?
In addition to the ability to store more data, we have to face more data types than ever before. The sources of these data include online trading, social networking, automatic sensors, mobile devices and scientific instruments, among others. In addition to the fixed sources of data production, various transactions can also speed up the accumulation of data. For example, the explosive growth of social-type multimedia data stems from new online transactions and record-keeping practices. Data is always growing, but the ability to store huge amounts of data is not enough because it does not guarantee that we can successfully search for business value from it.
Data is an important factor of production
In the information age, data has become an important factor of production, like capital, labor and raw materials and other elements, and as a general demand, it is no longer limited to the application of certain special industries. Companies from all walks of life collect and use a large number of data analysis results to reduce costs as much as possible, improve product quality, improve production efficiency and create new products. For example, by analyzing data collected directly from the product test site, you can help the enterprise improve the design. In addition, a company can surpass its rivals by analyzing customer behavior in depth, contrasting a lot of market data.
Storage technology must keep up
With the explosive growth of large data applications, it has spawned its own unique architecture, and has directly driven the development of storage, networking and computing. After all, the special need to deal with big data is a new challenge. The development of hardware is ultimately driven by software requirements, and in this case, it is clear that large data analysis application requirements are affecting the development of the data storage infrastructure.
On the other hand, this change is not an opportunity for storage vendors and other IT infrastructure vendors. With the continuous growth of structured and unstructured data and the diversification of data sources, the design of storage systems has not been able to meet the needs of large data applications. As the storage vendors realized this, they began to modify the architecture design of block and file based storage systems to accommodate these new requirements. Here we discuss what attributes are relevant to the large data storage infrastructure to see how they meet the challenges of large data.
Capacity issues
Here the "large capacity" can usually reach the PB-level data scale, therefore, the mass data storage system must have a corresponding level of expansion capabilities. At the same time, the expansion of the storage system must be simple, you can increase the capacity by adding modules or disk cabinets, and even do not need downtime. Based on this demand, customers are now increasingly favoring the storage of scale-out architectures. Scale-out cluster structure is characterized by a certain amount of storage capacity of each node, in addition to the internal data processing capacity and interconnection equipment, and traditional storage system chimney-style architecture is completely different, scale-out architecture can achieve seamless and smooth expansion to avoid storage islands.
The big Data application, in addition to the scale of the data, means that there is a huge number of files. Therefore, how to manage the accumulated metadata of the filesystem layer is a difficult problem, and improper handling can affect the scalability and performance of the system, which is the bottleneck of traditional NAS systems. Fortunately, there is no such problem with object-based storage architectures, which can manage the number of files at level 1 billion in a single system, and are not plagued by metadata management like traditional storage. Object-based Storage systems also have wide-area scalability to deploy and form a large, cross-regional storage infrastructure in several different locations.
Latency issues
The "Big Data" application also has a real time problem. In particular, it involves applications related to online transactions or financial classes. For example, the online advertising service in the apparel sales industry requires real-time analysis of customer browsing records and accurate advertising. This requires the storage system to be able to support these features while maintaining a high response speed, as the result of response latency is that the system pushes "expired" advertising content to the customer. In this scenario, the Scale-out architecture's storage system can play an advantage because each of its nodes has a processing and interconnect component that can grow synchronously while increasing capacity. The object-based storage System can support concurrent data flow, and further improve data throughput.
There are many "big data" applications that require high IOPS performance, such as HPC high-performance computing. In addition, the popularity of server virtualization has led to a need for high iops, just as it has changed the traditional IT environment. In order to meet these challenges, various models of solid-state storage equipment emerged, small to simple within the server to do the cache, large to solid-state media, such as scalable storage system, etc. are booming.
Concurrent access once the enterprise recognizes the potential value of large data analysis applications, they will compare more datasets into the system, while allowing more people to share and use the data. In order to create more business value, enterprises tend to synthetically analyze the various data objects from different platforms. The storage infrastructure, including the global file system, can help users solve data access problems, and the global file system allows multiple users on multiple hosts to access file data concurrently, which can be stored on multiple different types of storage devices in multiple locations.
Security issues
Some special industries, such as financial data, medical information and government intelligence, have their own security standards and confidentiality requirements. Although this is no different for IT managers, and all of them have to be obeyed, but large data analysis often requires multiple types of data to reference each other, and in the past there is no such data mixed access, so large data applications also spawned some new security issues to consider.
Cost issues
"Big" can also mean expensive. Cost control is a key issue for companies that are using large data environments. Trying to control costs means that we have to make each device more "efficient" while also reducing the expensive parts. Currently, technologies such as data deduplication have entered the primary storage market and can now handle more data types, which can bring more value to large data storage applications and improve storage efficiency. In an environment where data volumes are growing, a significant return on investment can be achieved by reducing the consumption of back-end storage, even if only a few percentage points. In addition, the use of automated compact configuration, snapshots, and cloning technology can also improve storage efficiency.
Many large data storage systems include archiving components, especially for organizations that need to analyze historical data or require long-term data retention. Tape is still the most economical storage medium from the point of view of unit capacity storage cost, in fact, in many enterprises, it is still the de facto standard and practice to use archive systems that support terabytes of high-capacity tape.
The most influential factor in cost control is those commercially available hardware devices. As a result, many first-time users and those with the largest number of applications will customize their own "hardware platforms" rather than off-the-shelf commercial products, a move that can be used to balance their cost-control strategies in the business expansion process. In order to adapt to this demand, more and more storage products are now provided in the form of pure software, can be directly installed in the user's existing, general-purpose or off-the-shelf hardware devices. In addition, a lot of storage software companies are also selling software products as the core of soft and hard integration devices, or with the hardware manufacturers alliances, the introduction of cooperative products.
Data accumulation
Many large data applications involve compliance issues that typically require data to be kept for years or decades. For example, medical information is usually designed to keep patients safe, and financial information is usually kept for 7 years. Some users who use large data stores want the data to hold longer, because any data is part of the history, and the analysis of the data is mostly based on time periods. To achieve long-term data retention, storage vendors are required to develop the ability to consistently detect data consistency and other features that guarantee long-term high availability. At the same time, the function requirement of data directly in in-situ updating is also realized.
Flexibility
The infrastructure of large data storage systems is usually large and must be carefully designed to ensure the flexibility of the storage system so that it can be scaled up and expanded along with the application analytics software. In a large data storage environment, there is no need to do any more data migrations because the data is saved at multiple deployment sites at the same time. Once a large data storage infrastructure is put into use, it is difficult to adjust, so it must be able to adapt to a variety of different application types and data scenarios.
Application perception
The first batch of users with large data have developed customized infrastructure for applications, such as systems developed for government projects, and dedicated servers created by large Internet service providers. In the mainstream storage system, the application of perceptual technology is more and more common, it is also an important means to improve the efficiency and performance of the system, so the application of perceptual technology should also be used in large data storage environment.
What about small users?
Relying on large data is not only the special large user groups, as a business needs, the future of small enterprises will certainly apply to large data. We see that some storage vendors are already developing small "big data" storage systems that are primarily attractive to users who are more cost sensitive.