Small files refer to files that are smaller than the block size (default 64M) of size HDFs. If you store small files in a HDFs, there will certainly be a lot of such small files in HDFs (otherwise you won't be using Hadoop). The problem with HDFs is that you can't handle a lot of small files efficiently.
Any file, directory, and block, in HDFs, is represented as an object stored in Namenode memory, and no object occupies the bytes memory space. So, if there are 10million files and none of the files correspond to a block, then the Namenode 3G memory will be consumed to save the block information. If the scale is bigger, it will exceed the limit that the computer hardware can meet at this stage.
Not only that, HDFs is not intended to effectively handle a large number of small files. It is designed primarily for streaming access to large files. Reading small files often results in a large number of seeks and hopping from Datanode to Datanode to retrieve files, which is a very inefficient way to access them.
Problems with a large number of small files in MapReduce
Map tasks are typically input for one block at a time (using Fileinputformat by default). If the file is very small and has a large number of small files, each map task processes only very small input data and produces a lot of map tasks, and each map task consumes a certain amount of bookkeeping resources. Compare a 1GB file, the default block size is 64M, and 1Gb files, no file 100KB, then the latter does not have a small file with a map task, then the job time will be 10 times times or even more slowly than the former.
There are some features in Hadoop that can be used to mitigate this problem: You can allow task reuse in one JVM to support running multiple map tasks in one JVM to reduce startup consumption for some JVMs ( By setting the Mapred.job.reuse.jvm.num.tasks property, 1,-1 is unrestricted by default. Another method is to use Multifileinputsplit, which allows multiple split to be handled in a map.
Why do you have a large number of small files?
There are at least two situations where a large number of small files are generated
1. These small files are pieces of a large logical file. Since HDFs only recently supported the append of files, it was used to add content to unbounde files, such as log files, by writing the data to HDFs in many chunks ways.
2. The document itself is very small. For example, many small picture files. Each picture is a separate file. And there is no efficient way to combine these files into one large file
These two situations require a different approach. For the first case, the file is made up of numerous records, which can be resolved by using the sync () method of the unfaithful call HDFs (in conjunction with the Append method). Alternatively, you can use a few programs to specifically merge these small files (find Nathan Marz's post about a tool called the Consolidator abound does exactly this).
For the second case, some form of container is needed to group these file in some way. Hadoop offers some options:
HAR files
The Hadoop archives (HAR files) was introduced in version 0.18.0 to alleviate the problem of large amounts of small file consumption Namenode memory. The Har file works by building a hierarchical file system on the HDFs. A har file is created by the Archive command of Hadoop, which actually runs a mapreduce task to package small files into har. The use of Har files has no effect on the client side. All original files are visible && accessible (using Har://url). But the number of files inside it has been reduced in the HDFs.
Reading a file through Har is not as efficient as reading a file directly from HDFs, and it may actually be a little inefficient, since access to each Har file requires reading of the two-tier index file and reading of the file's own data (see above). And although Har files can be used as input to the MapReduce job, there is no special way for maps to treat files packaged in Har files as a HDFs file. You can consider using the advantages of Har files to improve the efficiency of mapreduce by creating an input format, but no one has yet made this input format. Note that multifileinputsplit, even in the HADOOP-4565 improvements (choose files in a-split that are node-local), always needs to seek the Sgt file.
Sequence Files
Usually the response to the "The Sgt Files Problem" would be: using Sequencefile. This approach is to say, use filename as key, and file contents as value. This approach works well in practice. Back to 10,000 100KB of files, you can write a program to write these small files to a separate sequencefile, and then you can streaming fashion (directly or using MapReduce) To use this sequencefile. Not only that, Sequencefiles is also splittable, so mapreduce can break them into chunks, and separate processing. Unlike Har, this approach also supports compression. Block compression is the best choice in many cases because it compresses multiple records together rather than one record at a time.
Converting an existing number of small files into a sequencefiles may be slow. However, it is entirely possible to create a series of sequencefiles in a parallel way. (Stuart Sierra super-delegates written a very useful post about converting a tar file into a sequencefile-tools like this are very). Further, it might be best to design your own data pipeline to write data directly to a sequencefile.
Link: http://nicoleamanda.blog.163.com/blog/static/749961072009111805538447/
Original link: http://www.cloudera.com/blog/2009/02/the-small-files-problem/