Hadoop Small File optimization

Last Update:2014-12-22 Source: Internet

Author: User

Keywords DFS NBSP through can mass

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint a good article about Hadoop small file optimization.

Original English from: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

Translation Source: http://nicoleamanda.blog.163.com/blog/static/749961072009111805538447/

After reading this article combined with their previous use of Har file packaging small files experience, feel whether the use of archive command Packaging har file or assembly sequence files efficiency is not particularly high, the need for additional process costs. If you can solve small file problems it is best to be in the data terminal output when the merger, so that it entered the HDFs before the solution.

-----------------------------------below is the original--------------------------------------------

First find out what a small file is in Hadoop: Small files refer to the size of the file that is larger than the HDFs (the default block size is 64M when hadoop1.x, which can be set by dfs.blocksize); but to Hadoop 2.x when the default block size is 128MB, can be set by dfs.block.size the small number of files. If you store small files in a HDFs, there will certainly be a lot of such small files in HDFs (otherwise you won't be using Hadoop). The problem with HDFs is that you can't handle a lot of small files efficiently.
in HDFs, any file, directory, and block, in HDFs, is represented as an object stored in Namenode memory, and no object occupies the bytes memory space. So, if there are 10million files and none of the files correspond to a block, then the Namenode 3G memory will be consumed to save the block information. If the scale is bigger, it will exceed the limit that the computer hardware can meet at this stage.
not only that, HDFs does not exist to effectively handle a large number of small files. It is designed primarily for streaming access to large files. Reading small files often results in a large number of seeks and hopping from Datanode to Datanode to retrieve files, which is a very inefficient way to access them.
problems with large numbers of small files in MapReduce
Map tasks are typically input for one block at a time (using Fileinputformat by default). If the file is very small and has a large number of small files, each map task processes only very small input data and produces a lot of map tasks, and each map task consumes a certain amount of bookkeeping resources. Compare a 1GB file, the default block size is 64M, and 1Gb files, each file 100KB, then the latter does not have a small file to use a map task, then the job time will be 10 times times or even more slowly than the former.
There are some features in Hadoop that can be used to mitigate this problem: You can allow task reuse in one JVM to support running multiple map tasks in one JVM to reduce startup consumption for some JVMs ( By setting the Mapred.job.reuse.jvm.num.tasks property, 1,-1 is unrestricted by default. Another method is to use the MultifileiNputsplit, it enables a map to handle multiple split.
Why does a large number of small files occur? There are at least two scenarios in which a large number of small files are generated
1. These small files are pieces of a large logical file. Since HDFs only recently supported the append of files, it was used to add content to unbounde files, such as log files, by writing the data to HDFs in many chunks ways.
2. The file itself is very small. For example, many small picture files. Each picture is a separate file. And there is no efficient way to merge these files into a large file

These two situations require a different approach. For the first case, the file is made up of a large number of records, which can be resolved by calling the HDFs Sync () method (used in conjunction with the Append method). Alternatively, you can use a few programs to specifically merge these small files (find Nathan Marz's post about a tool called the Consolidator abound does exactly this).

For the second case, some form of container is needed to group these file in some way. Hadoop offers some options:

HAR files

The Hadoop archives (HAR files) was introduced in version 0.18.0 to alleviate the problem of large amounts of small file consumption Namenode memory. The Har file works by building a hierarchical file system on the HDFs. A har file is created by the Archive command of Hadoop, which actually runs a mapreduce task to package small files into har. The use of Har files has no effect on the client side. All original files are visible && accessible (using Har://url). But the number of files inside it has been reduced in the HDFs.
Hadoop issues and solutions for handling large numbers of small files –nicoleamanda– just want a simple life. Reading a file by Har is not as efficient as reading a file directly from HDFs, and it can actually be a little bit inefficient because access to each Har file requires To complete the reading of the two-tier index file and the reading of the file's own data (see above). And although Har files can be used as input to the MapReduce job, there is no special way for maps to treat files packaged in Har files as a HDFs file. You can consider using the advantages of Har files to improve the efficiency of mapreduce by creating an input format, but no one has yet made this input format. Note that multifileinputsplit, even in the HADOOP-4565 improvements (choose files in a-split that are node-local), always needs to seek the Sgt file.

Hadoop har

* Sequence Files
Usually the response to the "The Sgt Files Problem" would be: using Sequencefile. This approach is to say, use filename as key, and file contents as value. This approach works well in practice. Back to 10,000 100KB of files, you can write a program to write these small files to a separate sequencefile, and then you can streaming fashion (directly or using MapReduce) To use this sequencefile. Not only that, Sequencefiles is also splittable, so mapreduce can break them into chunks, and separate processing. Unlike Har, this approach also supports compression. Block compression is the best choice in many cases because it compresses multiple records together rather than one record at a time.

Hadoop Sequencefile

Converting an existing number of small files into a sequencefiles may be slow. However, it is entirely possible to create a series of sequencefiles in a parallel way. (Stuart Sierra super-delegates written a very useful post about converting a tar file into a sequencefile-tools like this are very). More into a Step, if it is possible to design your own data pipeline to write the data directly to a sequencefile.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More