Small file solution based on hadoop sequencefile

Source: Internet
Author: User

I. Overview
A small file is a file whose size is smaller than the block size on HDFS. Such files will cause serious problems to the scalability and performance of hadoop. First, in HDFS, any block, file, or directory is stored in the memory as an object. Each object occupies about 150 bytes. If there are 1000 0000 small files, if each file occupies a block, namenode requires about 2 GB of space. If you store 0.1 billion files, namenode requires 20 GB space. In this way, the namenode memory capacity severely limits the cluster expansion. Second, the speed of accessing a large number of small files is far smaller than that of accessing several large files. HDFS was initially developed for streaming access to large files. If you access a large number of small files, you need to constantly jump from one datanode to another, seriously affecting performance. Finally, processing a large number of small files is much faster than processing a large file of the same size. Each small file occupies a slot, and task startup takes a lot of time or even most of the time to start and release tasks.

Ii. hadoop's own solutions

For small files, hadoop also provides several solutions: hadoop archive, sequence file, and combinefileinputformat.

(1) hadoop Archive

Hadoop archive or Har is a file archiving tool that efficiently puts small files into HDFS blocks. It can Package Multiple small files into one Har file, this reduces the memory usage of namenode while still allowing transparent access to files.

Two points are required for using har. First, the original file is not automatically deleted after the small file is archived. Second, the process of creating the Har file is actually running a mapreduce job. Therefore, a hadoop cluster needs to run this command.

This solution requires manual maintenance and is applicable to management operations. Once a har file is created, archives cannot be changed and cannot be used for Internet operations by multiple users.

(2) Sequence File

Sequence File is composed of a series of binary key/value. If it is a key small file name and value is the file content, a large number of small files can be merged into a large file.

Sequencefile, including writer, reader, and sequencefilesorter classes, is provided in the Hadoop-0.21.0 for write, read, and sort operations. If the hadoop version is earlier than 0.21.0, see [3].

This solution is free to access small files without limiting the number of users and files. However, sequencefile files cannot be appended and are suitable for writing a large number of small files at a time.

(3) combinefileinputformat

Combinefileinputformat is a new inputformat used to merge multiple files into a separate split. In addition, it considers the data storage location.

The version of this solution is relatively old and there is very little information on the Internet. From the perspective of the information, there should be no second solution.

Iii. Small file solution

Add a small file processing module based on the original HDFS. The procedure is as follows:

  1. When a user uploads a file, determines whether the file belongs to a small file. If yes, the file is handed over to the small file processing module for processing. Otherwise, the file is handed over to the general file processing module for processing.
  2. When a task is enabled in the small file module, the main function is to use the sequencefile component as the key by file name when the total file size in the module is greater than the block size on HDFS, the corresponding file content is value, which writes these small files to the HDFS module at one time.
  3. Delete the processed files and write the results to the database.
  4. When you perform a read operation, you can read files based on the results mark in the database.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.