HadoopArchives Guide

Source: Internet
Author: User
Keywords File system directory archive HadoopArchives
Tags .mall archive archives block create delete directory example
Hadoop Archives Guide Overview

Hadoop archives is a type of archive. According to the official website, a Hadoop archive corresponds to a file system directory. So why do we need Hadoop Archives? Because hdfs is not good at storing small files, files are stored as blocks on hdfs, which store their metadata and other metadata in the namenode, which is loaded into memory after the namenode is started. If there are a large number of small files (the file size is smaller than the size of the block), such as: a block 128MB, for a 128MB file only need to store a block (assumed to be a backup) namenode also stores only one piece of metadata , While for a 128 1MB file, you would have to store 128 metadata in the namenode. Obviously, a small number of large files will be smaller than the namenode memory consumed by a large number of small files. Hadoop Archives is out for the consideration that multiple files are archived as a single file, which can be transparently accessed for each file after being archived and can be used as an input to a mapreduce task, thus reducing the memory consumption of the namenode. Hadoop archives are similar to tar on linux, used to archive files on hdfs with the .har extension. Hadoop archive contains metadata (in the form of _index and _masterindx) and data (part-) files. The _index file contains the file name and location of the file in the file.

Create Archive hadoop archive -archiveName name -p [-r] *

-archiveName name: Specifies the name of the archive, for example test.har

-p parent: Used to specify the parent directory, said a relative path, specify this parameter, followed by src and dest are relative to this directory. such as,

-p / foo / bar a / b / ce / f / g

Here you specify / foo / bar / a / b / c and / foo / bar / e / f / g if you specify / foo / bar as your parent directory

-R said copy factor, if not specified, the default is 10.

Other things to note is that using the hadoop archive tool produces a mapreduce program, so first make sure the cluster is able to run mapreduce.

The following example shows a single directory / foo / bar archiving:

hadoop archive -archiveName zoo.har -p / foo / bar -r 3 / outputdir

The following example shows right

View created Archive

The archive itself is exposed to the outside world as a file system layer, so all commands for the hadoop fs shell can be run on the archive. The difference is, you can not continue to use hdfs: / / host: 8020 URI, but to use the archive file system URI. Also note that the archive file is immutable, therefore, you can not delete and rename the archive file. The URI for Hadoop Archives is:

har: // scheme-hostname: port / archivepath / fileinarchive Unarchive

Because you can use all the fs shell commands on the archive file system and have access to the archive files transparently, you can unarchive them directly.

Such as:

hdfs dfs -cp har: ///user/zoo/foo.har/dir1 hdfs: / user / zoo / newdir

Or use distcp

hadoop distcp har: ///user/zoo/foo.har/dir1 hdfs: / user / zoo / newdir Take a coherent example (from the official)

Create Archive

hadoop archive -archiveName foo.har -p / user / hadoop -r 3 dir1 dir2 / user / zoo

The above sentence uses / user / hadoop as a relative path to create an archive file called foo.har, which contains / user / hadoop / dir1 and / user / hadoop / dir2. This command will not delete / user / hadoop / dir1 and / user / hadoop / dir2. You can only delete it manually if you want to delete the input file after creating the archive file (this can reduce the namespace).

View Archive

hadoop fs -ls har: ///user/zoo/foo.har

The output is as follows

har: ///user/zoo/foo.har/dir1 har: ///user/zoo/foo.har/dir2
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.