Hadoop detailed (v) archives

Source: Internet
Author: User
Tags file system parent directory hadoop fs

Brief introduction

We studied in Hadoop: (i)--hdfs introduction has said, HDFs is not good at storing small files, because each file at least one block, the metadata of each blocks will occupy the Namenode node memory, if there are such a large number of small files, they will eat Namenode a large amount of memory for a node.

Hadoop archives can effectively handle the above issues, he can file a number of files into a file, archived into a file can also be transparent access to each file, and can be as input to the MapReduce task.

Usage

Hadoop archives can be created using the Archive tool, as in the distcp of the previous article, archive is also a mapreduce task. Let's first look at my directory structure:

[Hadoop@namenode ~] $hadoop FS-LSR  
drwxr-xr-x   -Hadoop supergroup          0 2013-06-20 12:37/user/hadoop/har  
Drwxr-xr-x   -Hadoop supergroup          0 2013-05-23 11:35/user/hadoop/input  
-rw-r--r--   2 Hadoop supergroup     888190 2013-05-23 11:35/user/hadoop/input/1901  
-rw-r--r--   2 hadoop supergroup     888978 2013-05-23 11:35/user/hadoop/input/1902  
-rw-r--r--   2 hadoop supergroup        293 2013-06-02 17:44/user/hadoop/ News.txt

We are archiving this directory through the Archive tool

Hadoop archive-archivename input.har-p/user/hadoop/input har

Archivename Specify archive file name,-p for the parent directory, you can put multiple directory files into the archive, we look at the creation of a good har file.

<div style= "Text-align:left;" ><span style= "font-

family:arial, song body, Sans-serif; ">[hadoop@namenode ~] $hadoop fs-ls 

har</span></div>found 1 Items  
drwxr-xr-x   -Hadoop SuperGroup          0 2013-06-20 12:38/user/hadoop/har/input.har  
[hadoop@namenode ~] $hadoop fs-ls  
Har/input.har Found 4 Items  
-rw-r--r--   2 hadoop supergroup          0 2013-06-20 12:38/user/hadoop/har/input.har/_success 

 
-rw-r--r--   5 hadoop supergroup        272 2013-06-20 12:38/user/hadoop/har/input.har/_index  
- rw-r--r--   5 hadoop supergroup         2013-06-20 12:38 

/user/hadoop/har/input.har/_masterindex  
- rw-r--r--   2 hadoop supergroup    1777168 2013-06-20 12:38/user/hadoop/har/input.har/part-

0

Here you can see that the Har file includes two index files, multiple part files, and only one here. Part file is a collection of multiple original files, according to the index file to find the original file.

If accessed using the Har Uri, the files will be hidden, displaying only the original file

[Hadoop@namenode ~] $hadoop fs-lsr har:///user/hadoop/har/input.har  
drwxr-xr-x   -Hadoop supergroup          0 2013-05-23 11:35/user/hadoop/har/input.har/input  
-rw-r--r--   2 hadoop supergroup     888978 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1902  
-rw-r--r--   2 hadoop supergroup     888190 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1901

You can also access files at the Har level like the normal file system

[Hadoop@namenode ~] $hadoop fs-lsr har:///user/hadoop/har/input.har/input  
-rw-r--r--   2 Hadoop supergroup     888978 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1902  
-rw-r--r--   2 Hadoop supergroup     888190 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1901

If you want to access remotely, you can use the following command


[Hadoop@namenode ~] $hadoop fs-lsr har://hdfs-namenode:9000/user/hadoop/har/input.har/input  
-rw-r--r--   2 Hadoop supergroup     888978 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1902  
-rw-r--r--   2 Hadoop supergroup     888190 2013-05-23 11:35 

/user/hadoop/har/input.har/input/1901

Har opening description When the Har file system, hdfs-domain name: port, har file system conversion until the end of the Har file, the example will be converted to Hdfs://namenode:9000/user/hadoop/har/input.har, the remaining Part still opens in Archive mode: input

Deleting a file is relatively simple, but it needs to be deleted recursively, otherwise the error

[Hadoop@namenode ~] $hadoop fs-rmr har/input.har  
Deleted Hdfs://192.168.115.5:9000/user/hadoop/har/input.har

Limit

The archive file has some limitations:

1. Create archive files to consume as much hard disk space as the original file

2.archive files do not support compression, although the archive file looks like it has been compressed.

3.archive files cannot be changed once they are created, which means that you have to change something, you need to innovate create archive files

4. Although the Namenode memory space problem has been solved, but in the implementation of MapReduce, will be a number of small files to the same mapreduce to split, which is obviously inefficient

The problem with Namenode memory can refer to the HDFs federation in the previous article.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.