Small file solution based on hadoop sequencefile

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Overview
A small file is a file whose size is smaller than the block size on HDFS. Such files will cause serious problems to the scalability and performance of hadoop. First, in HDFS, any block, file, or directory is stored in the memory as an object. Each object occupies about 150 bytes. If there are 1000 0000 small files, if each file occupies a block, namenode requires about 2 GB of space. If you store 0.1 billion files, namenode requires 20 GB space. In this way, the namenode memory capacity severely limits the cluster expansion. Second, the speed of accessing a large number of small files is far smaller than that of accessing several large files. HDFS was initially developed for streaming access to large files. If you access a large number of small files, you need to constantly jump from one datanode to another, seriously affecting performance. Finally, processing a large number of small files is much faster than processing a large file of the same size. Each small file occupies a slot, and task startup takes a lot of time or even most of the time to start and release tasks.

Ii. hadoop's own solutions

For small files, hadoop also provides several solutions: hadoop archive, sequence file, and combinefileinputformat.

(1) hadoop Archive

Hadoop archive or Har is a file archiving tool that efficiently puts small files into HDFS blocks. It can Package Multiple small files into one Har file, this reduces the memory usage of namenode while still allowing transparent access to files.

Two points are required for using har. First, the original file is not automatically deleted after the small file is archived. Second, the process of creating the Har file is actually running a mapreduce job. Therefore, a hadoop cluster needs to run this command.

This solution requires manual maintenance and is applicable to management operations. Once a har file is created, archives cannot be changed and cannot be used for Internet operations by multiple users.

(2) Sequence File

Sequence File is composed of a series of binary key/value. If it is a key small file name and value is the file content, a large number of small files can be merged into a large file.

Sequencefile, including writer, reader, and sequencefilesorter classes, is provided in the Hadoop-0.21.0 for write, read, and sort operations. If the hadoop version is earlier than 0.21.0, see [3].

This solution is free to access small files without limiting the number of users and files. However, sequencefile files cannot be appended and are suitable for writing a large number of small files at a time.

(3) combinefileinputformat

Combinefileinputformat is a new inputformat used to merge multiple files into a separate split. In addition, it considers the data storage location.

The version of this solution is relatively old and there is very little information on the Internet. From the perspective of the information, there should be no second solution.

Iii. Small file solution

Add a small file processing module based on the original HDFS. The procedure is as follows:

When a user uploads a file, determines whether the file belongs to a small file. If yes, the file is handed over to the small file processing module for processing. Otherwise, the file is handed over to the general file processing module for processing.
When a task is enabled in the small file module, the main function is to use the sequencefile component as the key by file name when the total file size in the module is greater than the block size on HDFS, the corresponding file content is value, which writes these small files to the HDFS module at one time.
Delete the processed files and write the results to the database.
When you perform a read operation, you can read files based on the results mark in the database.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Small file solution based on hadoop sequencefile

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Small file solution based on hadoop sequencefile

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support