HadoopArchives Guide

Last Update:2017-08-10 Source: Internet

Author: User

Keywords File system directory archive HadoopArchives

Tags .mall archive archives block create delete directory example

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop Archives Guide Overview

Hadoop archives is a type of archive. According to the official website, a Hadoop archive corresponds to a file system directory. So why do we need Hadoop Archives? Because hdfs is not good at storing small files, files are stored as blocks on hdfs, which store their metadata and other metadata in the namenode, which is loaded into memory after the namenode is started. If there are a large number of small files (the file size is smaller than the size of the block), such as: a block 128MB, for a 128MB file only need to store a block (assumed to be a backup) namenode also stores only one piece of metadata , While for a 128 1MB file, you would have to store 128 metadata in the namenode. Obviously, a small number of large files will be smaller than the namenode memory consumed by a large number of small files. Hadoop Archives is out for the consideration that multiple files are archived as a single file, which can be transparently accessed for each file after being archived and can be used as an input to a mapreduce task, thus reducing the memory consumption of the namenode. Hadoop archives are similar to tar on linux, used to archive files on hdfs with the .har extension. Hadoop archive contains metadata (in the form of _index and _masterindx) and data (part-) files. The _index file contains the file name and location of the file in the file.

Create Archive hadoop archive -archiveName name -p [-r] *

-archiveName name: Specifies the name of the archive, for example test.har

-p parent: Used to specify the parent directory, said a relative path, specify this parameter, followed by src and dest are relative to this directory. such as,

-p / foo / bar a / b / ce / f / g

Here you specify / foo / bar / a / b / c and / foo / bar / e / f / g if you specify / foo / bar as your parent directory

-R said copy factor, if not specified, the default is 10.

Other things to note is that using the hadoop archive tool produces a mapreduce program, so first make sure the cluster is able to run mapreduce.

The following example shows a single directory / foo / bar archiving:

hadoop archive -archiveName zoo.har -p / foo / bar -r 3 / outputdir

The following example shows right

View created Archive

The archive itself is exposed to the outside world as a file system layer, so all commands for the hadoop fs shell can be run on the archive. The difference is, you can not continue to use hdfs: / / host: 8020 URI, but to use the archive file system URI. Also note that the archive file is immutable, therefore, you can not delete and rename the archive file. The URI for Hadoop Archives is:

har: // scheme-hostname: port / archivepath / fileinarchive Unarchive

Because you can use all the fs shell commands on the archive file system and have access to the archive files transparently, you can unarchive them directly.

Such as:

hdfs dfs -cp har: ///user/zoo/foo.har/dir1 hdfs: / user / zoo / newdir

Or use distcp

hadoop distcp har: ///user/zoo/foo.har/dir1 hdfs: / user / zoo / newdir Take a coherent example (from the official)

Create Archive

hadoop archive -archiveName foo.har -p / user / hadoop -r 3 dir1 dir2 / user / zoo

The above sentence uses / user / hadoop as a relative path to create an archive file called foo.har, which contains / user / hadoop / dir1 and / user / hadoop / dir2. This command will not delete / user / hadoop / dir1 and / user / hadoop / dir2. You can only delete it manually if you want to delete the input file after creating the archive file (this can reduce the namespace).

View Archive

hadoop fs -ls har: ///user/zoo/foo.har

The output is as follows

har: ///user/zoo/foo.har/dir1 har: ///user/zoo/foo.har/dir2

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More