1. Overview
A small file is a file with a size smaller than a block of HDFs. Such files can cause serious problems with the scalability and performance of Hadoop. First, in HDFs, any block, file or directory in memory is stored as objects, each object is about 150byte, if there are 1000 0000 small files, each file occupies a block, then Namenode needs about 2G space. If you store 100 million files, Namenode requires 20G of space (see resources [1][4][5]). This namenode memory capacity severely restricts the expansion of the cluster. Second, accessing a large number of small files is far less than accessing several large files. HDFs was originally developed for streaming access to large files, and if you access a large number of small files, you need to constantly jump from one datanode to another datanode, severely impacting performance. Finally, processing large numbers of small files is far less than processing large files of equal size. Each small file takes up a slot, and task initiation can take a lot of time, or most of the time, to start a task and release a task.
This article first introduces Hadoop's own solution to small file problems (provided as a tool), including Hadoop archive,sequence file and Combinefileinputformat And then introduced two articles from the system level to solve the HDFs small file, one is the Chinese Academy of Sciences published in 2009, to solve the HDFs on the storage of small geographic information files, the other is IBM published in 2009, to solve the HDFs on the storage of small PPT file program.
2. hdfs file reading and writing process
Before formally introducing the HDFs small file storage scheme, let's first introduce the basic flow of file access on the current HDFS.
(1) Read the document flow
1) client send read file request to Namenode, if the file does not exist, return error message, otherwise, the file corresponding block and its Datanode location sent to the client
2) After the client receives the file location information, establishes the socket connection with the different datanode to obtain the data in parallel.
(2) Process of writing documents
1) Client side send write file request, Namenode Check whether the file exists, if already exists, return error message directly, otherwise, send to client some available Datanode node
2) The client will block the file, parallel storage to different nodes on the Datanode, after sending, the client sends information to Namenode and Datanode
3) After Namenode received the client information, send the confidence message to Datanode
4) Datanode also received the confirmation information of Namenode and Datanode, submit the write operation.
3. Hadoop comes with solutions
For small file problems, Hadoop itself offers several solutions, namely Hadoop archive,sequence file and Combinefileinputformat.
(1) Hadoop Archive
Hadoop archive or har, a file archive tool that efficiently places small files in HDFs blocks, can package multiple small files into a single har file, allowing transparent access to the file while reducing Namenode memory usage.
Archive all small files under a directory/foo/bar to/outputdir/zoo.har:
Hadoop archive-archivename zoo.har-p/foo/bar/outputdir
Of course, you can also specify the size of Har (using-dhar.block.size).
Har is a filesystem on the Hadoop file system, so all FS shell commands are available to Har files, except that the file path format is different, and Har's access path can be in the following two formats:
Har://scheme-hostname:port/archivepath/fileinarchive
Har:///archivepath/fileinarchive (This node)
You can view the files in the Har file archive like this:
Hadoop Dfs-ls Har:///user/zoo/foo.har
Output:
Har:///user/zoo/foo.har/hadoop/dir1
Har:///user/zoo/foo.har/hadoop/dir2
The use of Har requires two points, first, the small files are archived, the original file is not automatically deleted, users need to delete themselves; second, the process of creating an Har file is actually running a mapreduce job, so a Hadoop cluster is required to run this command.
In addition, Har has some drawbacks: first, once created, archives will not be changed. To add or remove files inside the file, you must recreate the archive. Second, there must be no spaces in the file name to be archived, otherwise an exception will be thrown, and spaces can be replaced with other symbols (using the-dhar.space.replacement.enable=true and-dhar.space.replacement parameters).
(2) Sequence file
Sequence file is composed of a series of binary key/value, if the key small file name, value is the content of the file, you can combine large batches of small files into one large file.
Hadoop-0.21.0 provides sequencefile, including Writer,reader and Sequencefilesorter classes for write, read, and sort operations. If the Hadoop version is lower than the version of 0.21.0, see [3] for an implementation method.
(3) Combinefileinputformat
Combinefileinputformat is a new inputformat for merging multiple files into a single split, and it takes into account where the data is stored.
4. Small file Problem Solving solution
The scenarios mentioned in the previous section require users to write their own programs to merge small files at intervals to reduce the number of small files. Then can you directly embed the small file processing module into HDFS in order to automatically identify the user to upload small files, and then automatically merge them.
This section describes two paper needles trying to solve the HDFs small file problem at the system level. These two papers put forward a solution to different applications, in fact, the idea is similar: in the original HDFS based on the addition of a small file processing module, when a file arrives, to determine whether the file is a small file, if it is, to the small file processing module processing, otherwise, to the General file Processing module processing. The idea of a small file processing module is to combine many small files into one large file and then index those small files for quick access and access.
The paper [4] proposes a solution to the small file storage of HDFs for the characteristics of Webgis system. Webgis is a new system based on Web and geographic Information System (GIS). In Webgis, in order to minimize the amount of data transferred between the browser and the server, the data is usually stored in a distributed file system with small files divided into kilobytes. Combining with the data correlation feature in Webgis, the paper merges the small files that save the neighboring geographic information into one large file and indexes the small files to access the small files.
This paper considers files smaller than 16MB as small files, merging them into 64MB (the default block size), and indexing, index structure and file storage as shown above. The index method is a general fixed-length hash index.
The paper [5] proposes a solution to the small file storage of HDFs for the BlueSky system (http://www.bluesky.cn/). BlueSky is an electronic teaching sharing system in China, where PPT files and videos are stored in HDFs. Each courseware of the system consists of a PPT file and a preview snapshot of the ppt file. When a user requests a page PPT, other related PPT may also be viewed over the next time, thus accessing the file is relevant and local. This article has 2 idea: First, the files belonging to the same courseware are combined into a large file to improve the storage efficiency of small files. Secondly, a two-level prefetching mechanism is proposed to improve the efficiency of small file reading, that is, index file prefetching and data file prefetching. Index file prefetching means that when a user accesses a file, the corresponding index file of the block in which the file resides is loaded into memory, so that users no longer have to interact with Namenode when they access the files. Data file prefetching refers to the fact that when a user accesses a file, all the files in the courseware that the file is in are loaded into memory, so that the user continues to access other files at a significantly higher speed.
The following illustration shows the process of uploading a file in BlueSky:
The following illustration shows the process of reading a file in BlueSky:
5. Summary
Hadoop does not currently have a system-wide solution to the problem of HDFs small files. It comes with three options, including Hadoop archive,sequence file and Combinefileinputformat, which require users to write programs to solve small file problems according to their needs, while the fourth section mentions the solutions for specific applications, There is not a more general technical solution.