Aiming at the problem of low storage efficiency of small and medium files in cloud storage system based on HDFS, the paper designs a scheme of small and medium file in cloud storage System with sequential file technology. Using multidimensional attribute decision theory, the scheme by combining the indexes of reading file time, merging file time and saving memory space, we get the best way of merging small files, and can achieve the balance between the time consumed and the memory space saved; The system load forecasting algorithm based on AHP is designed to predict the system load. To achieve the goal of load balancing, the use of sequential file technology to merge small files.
The experimental results show that the proposed scheme improves the HDFS (Hadoop Distributed File System) is a highly fault tolerant distributed filesystem model, which can be deployed on a common machine or virtual machine that supports Java running environment, without affecting the storage system running condition. To provide high throughput data access and is ideal for deploying cloud storage platforms.
HDFs uses master-slave architecture Design pattern (master/slavearchitecture), a name node (Namenode) and several data nodes (DataNode) to form HDFs clusters. HDFs's design of the single name node greatly simplifies the structure of the file system, but it also raises the problem of low efficiency of HDFS small file storage. Because the metadata information for each directory and file in the HDFs is stored in the memory of the name node, if there are a large number of small files in the system (those that are much larger than the HDFS data block (default 64MB) d, it will undoubtedly reduce the storage efficiency and storage capacity of the entire storage system.)
In a variety of storage systems, there are a large number of such small files. A 2007-year study by the Northwest Pacific National Laboratory showed that there were 1 2 million files in their system, of which 94% were less than mb,58%. In some specific research computing environments, there are also a large number of small files, for example, in some biological calculations may produce 3 0 files, and its average size is only 190 kB.
The main idea to solve the problem of storage efficiency of small files in HDFs storage System is to combine or combine small files into large files, and there are 2 kinds of methods, one is to use Hadoop archive (Hadoop Archive,har) technology to realize small file merging, The other is a file combination method for specific applications.
Mackey and other uses Har technology to realize the merging of small files, which improves the storage efficiency of metadata in HDFs. Combined with Webgis application, the Hdwebgis prototype system is developed with Hadoop as the storage platform, and the small files are combined into large files and global indexes by combining the features of Webgis access mode, which improves the efficiency of small file storage. Dong etc [4] aiming at the storage problem of PPT Courseware in BlueSky system, this paper proposes a method to improve the efficiency of storing and accessing small files by merging small files into large files and using prefetching mechanism. Liu Likun, the concurrent access of small files in distributed storage System is optimized.
The above research work is based on file merging or combination to solve the problem of small file storage efficiency is not high, however, there are 2 problems: first, as a complete system, while improving the efficiency of small file storage, you should also take into account the load status of the system, because whether it is a file combination or a file mix, For HDFs is an additional operation; second, the size of small file merging is not studied, that is, how many small files have not been determined to be merged into a large file can achieve optimal system performance.
Based on the above two points, this paper proposes a HDFs-oriented optimization scheme for the storage efficiency of small and medium files in cloud storage System: Using sequential file technology to merge smaller files into large files, combining multiple attribute decision theory and experiment to get the best way of merging files, through analytic hierarchy process (analytic hierarchy PROCESS,AHP), the system load balancing system is realized by the forecasting algorithm.
1 small file storage efficiency optimization scheme design
In the constructed cloud storage system, the structure of the multi fork tree is used to construct the file index. When the user uploads the file to the cloud storage system, the system will automatically establish the corresponding multi Fork Tree index according to the organization form of the user file.
1. 1 Sequence file Merging technology
A sequence file (sequencefile) is a binary file technology provided by HDFs, which is serialized directly to a file, which can be compressed based on a record or block of data when serialized. In the cloud storage system, the binary files are merged into large files by Sequencefile technology, the index number of small files is key, the content is value, and the compression based on data block is merged, so Saving the memory space of the name node also saves the disk space of the data node.
1. 2 small file storage efficiency optimization scheme
In the HDFS based cloud storage System, the small file storage efficiency optimization scheme is shown in Figure 1. To improve the processing efficiency of small files, the system establishes 3 queues for each user: The 1th is the sequence file queue (Sequencefile QUEUE,SFQ), the 2nd is the sequence file operation queue (Sequencefile twist Queue,sfoq), The 3rd type is the standby queue (backup QUEUE,BQ). Where SFQ is used for merging small files, Sfoq for operations on merged small files, BQ the number of small files used for operations over SFQ or SFOQ lengths. The length of 3 queues is consistent and the optimal value of queue length can be obtained by experiment. The specific process flow is described below.
Figure 1 Small file storage efficiency optimization scheme
As shown in Figure 1, the user uploads the local file to the cloud storage server (process 1), and then the server starts to judge the file's type (procedure 2) and, if it is a small file, puts the file's index number in the SFQ (procedure 3). When the SFQ is full, the "queue full" signal (QF) is sent to the controller, as shown in the dotted line in the figure, and when the timer reaches the timing point, a "time to" signal (TU) is sent to the controller, as shown in dotted line B. After receiving the QF or TU signal, the controller begins to read the SFQ information (procedure 4. 1), the system load is calculated (process 4. 2 (the specific algorithm is described in section 2nd), and decide whether or not to merge small files (process 5). After the file is merged, the mapping between the small file and the large file (process 6) Controller's concrete processing logic is shown in Figure 2.
When the controller receives the signal, first judge the signal type, if it is QF, then call the system load forecast algorithm based on AHP load. If the resulting system load is below the threshold set by the system, starts merging the files (including SFQ and BQ) and cancels the TU signal in the system, and if the system load is larger than the system set threshold, further determines the number of BQ, and if the number of BQ is less than a certain value (for example, 3), the new BQ, Transfer the SFQ to the BQ and postpone the merge operation (the system set the delay time to min), set the TU signal, if the number of BQ is greater than 3, then the small file in BQ will be merged to cancel the TU signal in the system.
1234 Next