Hadoop detailed (v) archives

Last Update:2017-02-27 Source: Internet

Author: User

Tags file system parent directory hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction

We studied in Hadoop: (i)--hdfs introduction has said, HDFs is not good at storing small files, because each file at least one block, the metadata of each blocks will occupy the Namenode node memory, if there are such a large number of small files, they will eat Namenode a large amount of memory for a node.

Hadoop archives can effectively handle the above issues, he can file a number of files into a file, archived into a file can also be transparent access to each file, and can be as input to the MapReduce task.

Usage

Hadoop archives can be created using the Archive tool, as in the distcp of the previous article, archive is also a mapreduce task. Let's first look at my directory structure:

[Hadoop@namenode ~] $hadoop FS-LSR  
drwxr-xr-x   -Hadoop supergroup          0 2013-06-20 12:37/user/hadoop/har  
Drwxr-xr-x   -Hadoop supergroup          0 2013-05-23 11:35/user/hadoop/input  
-rw-r--r--   2 Hadoop supergroup     888190 2013-05-23 11:35/user/hadoop/input/1901  
-rw-r--r--   2 hadoop supergroup     888978 2013-05-23 11:35/user/hadoop/input/1902  
-rw-r--r--   2 hadoop supergroup        293 2013-06-02 17:44/user/hadoop/ News.txt

We are archiving this directory through the Archive tool

Hadoop archive-archivename input.har-p/user/hadoop/input har

Archivename Specify archive file name,-p for the parent directory, you can put multiple directory files into the archive, we look at the creation of a good har file.

<div style= "Text-align:left;" ><span style= "font-

family:arial, song body, Sans-serif; ">[hadoop@namenode ~] $hadoop fs-ls 

har</span></div>found 1 Items  
drwxr-xr-x   -Hadoop SuperGroup          0 2013-06-20 12:38/user/hadoop/har/input.har  
[hadoop@namenode ~] $hadoop fs-ls  
Har/input.har Found 4 Items  
-rw-r--r--   2 hadoop supergroup          0 2013-06-20 12:38/user/hadoop/har/input.har/_success 

 
-rw-r--r--   5 hadoop supergroup        272 2013-06-20 12:38/user/hadoop/har/input.har/_index  
- rw-r--r--   5 hadoop supergroup         2013-06-20 12:38 

/user/hadoop/har/input.har/_masterindex  
- rw-r--r--   2 hadoop supergroup    1777168 2013-06-20 12:38/user/hadoop/har/input.har/part-

0

Here you can see that the Har file includes two index files, multiple part files, and only one here. Part file is a collection of multiple original files, according to the index file to find the original file.

If accessed using the Har Uri, the files will be hidden, displaying only the original file

[Hadoop@namenode ~] $hadoop fs-lsr har:///user/hadoop/har/input.har  
drwxr-xr-x   -Hadoop supergroup          0 2013-05-23 11:35/user/hadoop/har/input.har/input  
-rw-r--r--   2 hadoop supergroup     888978 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1902  
-rw-r--r--   2 hadoop supergroup     888190 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1901

You can also access files at the Har level like the normal file system

[Hadoop@namenode ~] $hadoop fs-lsr har:///user/hadoop/har/input.har/input  
-rw-r--r--   2 Hadoop supergroup     888978 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1902  
-rw-r--r--   2 Hadoop supergroup     888190 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1901

If you want to access remotely, you can use the following command


[Hadoop@namenode ~] $hadoop fs-lsr har://hdfs-namenode:9000/user/hadoop/har/input.har/input  
-rw-r--r--   2 Hadoop supergroup     888978 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1902  
-rw-r--r--   2 Hadoop supergroup     888190 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1901

Har opening description When the Har file system, hdfs-domain name: port, har file system conversion until the end of the Har file, the example will be converted to Hdfs://namenode:9000/user/hadoop/har/input.har, the remaining Part still opens in Archive mode: input

Deleting a file is relatively simple, but it needs to be deleted recursively, otherwise the error

[Hadoop@namenode ~] $hadoop fs-rmr har/input.har  
Deleted Hdfs://192.168.115.5:9000/user/hadoop/har/input.har

Limit

The archive file has some limitations:

1. Create archive files to consume as much hard disk space as the original file

2.archive files do not support compression, although the archive file looks like it has been compressed.

3.archive files cannot be changed once they are created, which means that you have to change something, you need to innovate create archive files

4. Although the Namenode memory space problem has been solved, but in the implementation of MapReduce, will be a number of small files to the same mapreduce to split, which is obviously inefficient

The problem with Namenode memory can refer to the HDFs federation in the previous article.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More