What is big data? The authoritative definition of IDC is as follows: big data is the data that meets the 4 V (variety, velocity, volume, value, that is, multiple types, large traffic, large capacity, and high value) indicators. IDC's positioning of big data technology is a new technical architecture that uses high-speed capturing, discovery, and/or analysis to obtain value from large data volumes. Big data mainly involves two different technical fields: one is to develop a big data storage platform that can be extended to PB or even EB level; the other is big data analysis, focus on processing a large number of different types of datasets in the shortest time. These two topics have been fully discussed. I am not going to discuss them here. Instead, I will think about big data from another angle. In fact, it may be more relevant to the big data storage platform. These requirements or ideas may be derived from users' vague needs, from the discussion of storage peers, and some insights from storage practices.
1. Data Backup
Information is the core asset of modern enterprises. If data is damaged or lost, small data may cause different degrees of economic losses. Therefore, enterprises have to pay great attention to important data backup. Before big data, the amount of data that enterprises need to back up usually ranges from GB to dozens of TB, with few enterprises having hundreds of TB of data. These data are often structured data of Oracle, DB2, SQL Server, and other databases, as well as unstructured data of file sharing services such as FTP, CIFS, and NFS, currently, backup systems of companies such as Symantec/Falcon/commvault/EMC/eisoo can meet common backup requirements. However, can big data still meet the backup requirements? The size of big data can easily reach dozens of terabytes or more, and the cases of hundreds of terabytes or even petabytes are no longer uncommon. In addition, these data types and large traffic are all new data. From the backup technology perspective, the full backup/Incremental backup/differential backup window will be large, and CDP's concurrent I/O capture and processing capabilities will be super powerful, otherwise, a large amount of data cannot be backed up. From the perspective of the backup data volume, the storage space required for backup must at least produce more than twice the data volume, which is a huge cost. Another important point is that big data is usually collected, stored, and processed in a distributed manner. Achieving Unified Data Backup poses a technical challenge to the backup system. Perhaps, big data is not suitable for using backup technology, but needs to be addressed by the storage system itself, such as multi-vesion and write
Any where, which can realize natural snapshots.
2. Long-term storage
Information has a life cycle. Many data, such as finance, commerce, finance, communications, and laws, must be kept in accordance with laws and regulations. Important scientific experiment data and historical data must be stored permanently. As an important asset of modern enterprises, big data is basically necessary for long-term preservation, for example, 10-20 years or even permanently. Long-term storage seems simple, but there are actually many problems to solve. Hundreds of terabytes or petabytes of big data. If it is non-active historical data, what media is used for storage? Disk, tape, or CD? Is it offline or nearline? How can I monitor the status of a large number of storage hardware devices? How does one ensure the integrity of massive data? How can I find and fix problems in long-term storage? How can I easily and quickly query and retrieve data when necessary? In addition, the storage space and energy consumption must be considered. In the face of these problems, we will find that the long-term storage of big data is also a big challenge. On the one hand, we need to improve the persistence, intelligence, and reliability of storage media, on the other hand, it is necessary to complete the management and monitoring of the information lifecycle management system.
3. Data Query
Data access is one of the most basic functions of the storage system. Traditional data access methods are used to locate and access data based on file names. File Name identifiers have certain ideographic features, but they are very inadequate. It is difficult to understand the content and features of data through file names. The query access semantics is very poor. You need to provide an accurate file name. Otherwise, you cannot locate and access the file. As the number of files increases, it will make it very difficult for users to access data. In the real world, people mainly remember and distinguish different things based on the characteristics of things, rather than simple names. In practical applications, if you can provide data access methods based on file attributes and content, a wide range of semantics will greatly increase the ideographic nature of data, thus greatly facilitating user use, improve data access efficiency. In the Internet
Search engines (such as Google and Baidu) can enter content keywords to query the desired data. In the database system, you can use the SQL language to query records and specify related conditions to filter query records. Therefore, compared with traditional data access methods, data access methods based on data content and attributes have strong semantics, which can effectively improve data positioning and access efficiency, it can greatly reduce the complexity of user use and is suitable for various data storage systems, especially distributed storage systems. At present, both natural language processing and web semantic networks have developed considerably. How can we implement semantic-based data access in big data management not only improves query efficiency, in addition, it conforms to people's thinking patterns and provides a more friendly data access interface.
4. Green Archiving
Due to regulatory compliance or long-term storage requirements, data can be archived Based on lifecycle management needs, such as tape archiving, disk archiving, CD archiving, and CAS system archiving. Big Data has a large amount of data. If you use disk media for archiving, there will be a large number of disks, and the energy consumption will be considerable under normal operation. To reduce energy consumption and achieve green archiving while effectively extending the service life of the disk, you need to consider related efficient storage technologies, including maid, semiraid, data compression, deduplication, and automatic and streamlined configuration. These technologies focus on two aspects. First, they reduce the amount of data to reduce disk media consumption, such as data compression, deduplication, and automatic configuration streamlining, the second is to control the disk media status (high speed, low speed, stop) or reduce the number of active disks to reduce energy consumption and prolong service life, such as maid and semiraid. Snia-related organizations specialize in green storage technologies, including the aforementioned technologies.
5. Unified Storage
There are many types of big data, including structured data, unstructured data, and object data, which are accessed using data block interfaces, file interfaces, and object interfaces respectively. At present, most enterprises have not unified the three types of data and use different storage systems to manage these three types of data. Under the pressure of rapid growth of big data, it brings about a series of problems such as low storage utilization efficiency, high management complexity, increasing costs, and low resource integration. Driven by these factors, the concept of unified storage has been revived. San/NAS unified storage has been promoted and launched by various storage vendors. Object Storage is also expected to be integrated into Unified Storage. In this way, you can use Unified Storage to manage big data, plan and integrate resources in a unified manner, improve storage resource utilization, simplify management, and reduce overall costs.
6. Storage media Life Management
The big data storage system has thousands of disks, which may include FC, SAS, and SATA disks, as well as ssd solid state disks, tapes, and other storage media. With such a large number of storage media, the probability of one or two disks being damaged every day is very high. Uncontrollable faults will affect front-end big data applications. The life cycle of the storage media is standard. Therefore, you can manage the life cycle of the storage media, make appropriate adjustments based on the actual environment, and analyze and predict faults based on the running status of the storage media. When the storage media service life is approaching, or the failure is predicted to happen, the Administrator is notified to change the storage media and the system automatically reconstructs the data. In this way, the randomness of storage medium faults can be effectively reduced, fault manageability can be enhanced, and manual scheduling can be combined to reduce or avoid the impact of faults on big data applications.
7. Tape Storage
There have been always predictions that tape is dead, but unfortunately this prediction has not yet come true. Compared with disks, tape has the features and advantages of cost, life, and energy consumption. In addition, tape technology is also evolving. For example, the write speed of the new lto5 is 180 Mb/s, the uncompressed capacity is increased to 1.6 TB, ensuring that tapes are still the most suitable for long-term data archiving and storage. These features cannot be replaced by disks. The most typical use of tapes in big data is data archiving. For example, the long-term storage and green archive mentioned above will not be accessed. Another form is Hierarchical Storage of HSM, tape, disk, SSD, and memory to form level 4 storage. Data flows between different levels of storage media according to the degree of activity, to achieve high cost effectiveness. Data on tapes in HSM will be accessed, but the frequency and probability are very low. Because of the advantages and continuous development of tape, it may not only survive, but will win a new life in the big data era.