One of the solutions to Hadoop small files Hadoop archive

Last Update:2015-08-09 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction HDFs is not good at storing small files, because each file at least one block, each block of metadata will occupy memory in the Namenode node, if there are such a large number of small files, they will eat the Namenode node's large amount of memory. Hadoop archives can effectively handle these issues, he can archive multiple files into a file, archived into a file can also be transparent access to each file, and can be used as a mapreduce task input.

Usage Hadoop archives can be created using the Archive tool, as in the previous distcp, archive is also a mapreduce task. First, let's look at my directory structure:

[Plain]View Plaincopy

[[Email protected] ~] $hadoop FS-LSR
Drwxr-xr-x-hadoop supergroup 0 2013-06-20 12:37/user/hadoop/har
Drwxr-xr-x-hadoop supergroup 0 2013-05-23 11:35/user/hadoop/input
-rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/input/1901
-rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/input/1902
-rw-r--r--2 hadoop supergroup 293 2013-06-02 17:44/user/hadoop/news.txt

We archive this directory with the archive tool [Plain]View Plaincopy

Hadoop archive-archivename input.har-p/user/hadoop/input har

Archivename Specify the file name of archive,-p for the parent directory, you can put more than one directory file into the archive, we look at the creation of a good har file. [Plain]View Plaincopy

<div style= "Text-align:left;" ><span style= "font-family:arial, song body, Sans-serif; ">[[email protected] ~] $hadoop fs-ls har</span></div>found 1 items
Drwxr-xr-x-hadoop supergroup 0 2013-06-20 12:38/user/hadoop/har/input.har
[[Email protected] ~] $hadoop fs-ls Har/input.har
Found 4 Items
-rw-r--r--2 hadoop supergroup 0 2013-06-20 12:38/user/hadoop/har/input.har/_success
-rw-r--r--5 Hadoop supergroup 272 2013-06-20 12:38/user/hadoop/har/input.har/_index
-rw-r--r--5 Hadoop supergroup 2013-06-20 12:38/user/hadoop/har/input.har/_masterindex
-rw-r--r--2 hadoop supergroup 1777168 2013-06-20 12:38/user/hadoop/har/input.har/part-0

Here you can see the Har file including, two index files, multiple part files, only one is shown here. The part file is a collection of multiple original files, and the original file is found according to the index file. If you use the Har URI to access it, the files will be hidden, showing only the original file. [Plain]View Plaincopy

[[Email protected] ~] $hadoop FS-LSR Har:///user/hadoop/har/input.har
Drwxr-xr-x-hadoop supergroup 0 2013-05-23 11:35/user/hadoop/har/input.har/input
-rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/har/input.har/input/1902
-rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/har/input.har/input/1901

You can also access the Har-level file like a normal file system [Plain]View Plaincopy

[[Email protected] ~] $hadoop FS-LSR har:///user/hadoop/har/input.har/input
-rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/har/input.har/input/1902
-rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/har/input.har/input/1901

If you want to access remotely, you can use the following command [Plain]View Plaincopy

[[Email protected] ~] $hadoop FS-LSR har://hdfs-namenode:9000/user/hadoop/har/input.har/input
-rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/har/input.har/input/1902
-rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/har/input.har/input/1901

Har at the beginning of the description of the Har file system, hdfs-domain name: port, har file system conversion until the end of the Har file, the example will be converted to Hdfs://namenode:9000/user/hadoop/har/input.har, The remaining parts are still open in Archive mode: Input Delete file is relatively simple, but need to be deleted recursively, otherwise error [Plain]View Plaincopy

[[Email protected] ~] $hadoop FS-RMR Har/input.har
Deleted Hdfs://192.168.115.5:9000/user/hadoop/har/input.har

Limit

Archive files have some limitations:

1. Create a archive file to consume as much hard disk space as the original file

The 2.archive file does not support compression, although the archive file looks like it has been compressed.

3.archive files cannot be changed once created, which means you have to change something, you need to innovate. Create a archive file

4. Although the memory space problem of the namenode is resolved, when the MapReduce is executed, multiple small files are handed over to the same mapreduce to split, which is obviously inefficient

The problem of resolving namenode memory can be referenced in the previous article of HDFs Federation.

One of the solutions to Hadoop small files Hadoop archive

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More