Hadoop issues and solutions for handling large numbers of small files
Source: Internet
Author: User
KeywordsMass DFS passed could then
A
Small file refers to a much smaller file size than the HDFS block size (default 64M). If you store small files in a HDFs, there will certainly be a lot of such small files in HDFs (otherwise you won't be using Hadoop). The problem with HDFs is that you can't handle a lot of small files efficiently. Any file, directory, and block, in HDFs, is represented as an object stored in http://www.aliyun.com/zixun/aggregation/11696.html ">namenode memory , no object occupies the memory space of the bytes. So, if there are 10million files and none of the files correspond to a block, then the Namenode 3G memory will be consumed to save the block information. If the scale is bigger, it will exceed the limit that the computer hardware can meet at this stage. not only that, HDFs does not exist to effectively handle a large number of small files. It is designed primarily for streaming access to large files. Reading small files often results in a large number of seeks and hopping from Datanode to Datanode to retrieve files, which is a very inefficient way to access them. problems with large numbers of small files in MapReduce Map tasks are typically input for one block at a time (using Fileinputformat by default). If the file is very small and has a large number of small files, each map task processes only very small input data and produces a lot of map tasks, and each map task consumes a certain amount of bookkeeping resources. Compare a 1GB file, the default block size is 64M, and 1Gb files, no file 100KB, then the latter does not have a small file with a map task, then the job time will be 10 times times or even more slowly than the former. There are some features in Hadoop that can be used to mitigate this problem: You can allow task reuse in one JVM to support running multiple map tasks in one JVM to reduce startup consumption for some JVMs ( By setting the Mapred.job.reuse.jvm.num.tasks property, 1,-1 is unrestricted by default. Another method is to use Multifileinputsplit, which allows multiple split to be handled in a map. Why does a large number of small files occur? There are at least two scenarios in which a large number ofSmall file 1. These small files are pieces of a large logical file. Since HDFs only recently supported the append of files, it was used to add content to unbounde files, such as log files, by writing the data to HDFs in many chunks ways. 2. The file itself is very small. For example, many small picture files. Each picture is a separate file. And there is no efficient way to merge these files into a large file These two situations require a different approach. For the first case, the file is made up of numerous records, which can be resolved by using the sync () method of the unfaithful call HDFs (in conjunction with the Append method). Alternatively, you can use a few programs to specifically merge these small files (find Nathan Marz's post about a tool called the Consolidator abound does exactly this). for the second case, some form of container is required to group these file in some way. Hadoop offers some choices: HAR files Hadoop archives (HAR files) is introduced in version 0.18.0, and it appears to alleviate the problem of large amounts of small file consumption Namenode memory. The Har file works by building a hierarchical file system on the HDFs. A har file is created by the Archive command of Hadoop, which actually runs a mapreduce task to package small files into har. The use of Har files has no effect on the client side. All original files are visible && accessible (using Har://url). But the number of files inside it has been reduced in the HDFs.
reading a file through Har is not as efficient as reading a file directly from HDFs, and may actually be a little less effective, since access to each Har file requires reading of the two-tier index file and reading of the file's own data (see above). And although Har files can be used as input to the MapReduce job, there is no special way for maps to treat files packaged in Har files as a HDFs file. You can consider using the advantages of Har files to improve the efficiency of mapreduce by creating an input format, but no one has yet made this input format. Note that multifileinputsplit, even in the HADOOP-4565 improvements (choose files in a-split that are node-local), always needs to seek the Sgt file. Sequence files usually responds to the "The Sgt Files problem" by using Sequencefile. This approach is to say, use filename as key, and file contents as value. This approach works well in practice. Back to 10,000 100KB of files, you can write a program to write these small files to a separate sequencefile, and then you can streaming fashion (directly or using MapReduce) To use this sequencefile. Not only that, Sequencefiles is also splittable, so mapreduce can break them into chunks, and separate processing. Unlike Har, this approach also supports compression. Block compression is the best choice in many cases because it compresses multiple records together rather than one record at a time. Converting existing many small files into a sequencefiles may be slow. However, it is entirely possible to create a series of sequencefiles in a parallel way. (Stuart Sierra super-delegates written a very useful post about converting a tar file into a sequencefile-tools like this are very). More into aStep, if it is possible to design your own data pipeline to write the data directly to a sequencefile.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.