Hadoop Archives Guide Overview
Hadoop archives is a type of archive. According to the official website, a Hadoop archive corresponds to a file system directory. So why do we need Hadoop Archives? Because hdfs is not good at storing small files, files are stored as blocks on hdfs, which store their metadata and other metadata in the namenode, which is loaded into memory after the namenode is started. If there are a large number of small files (the file size is smaller than the size of the block), such as: a block 128MB, for a 128MB file only need to store a block (assumed to be a backup) namenode also stores only one piece of metadata , While for a 128 1MB file, you would have to store 128 metadata in the namenode. Obviously, a small number of large files will be smaller than the namenode memory consumed by a large number of small files. Hadoop Archives is out for the consideration that multiple files are archived as a single file, which can be transparently accessed for each file after being archived and can be used as an input to a mapreduce task, thus reducing the memory consumption of the namenode. Hadoop archives are similar to tar on linux, used to archive files on hdfs with the .har extension. Hadoop archive contains metadata (in the form of _index and _masterindx) and data (part-) files. The _index file contains the file name and location of the file in the file.
Create Archive hadoop archive -archiveName name -p [-r] *
-archiveName name: Specifies the name of the archive, for example test.har
-p parent: Used to specify the parent directory, said a relative path, specify this parameter, followed by src and dest are relative to this directory. such as,
-p / foo / bar a / b / ce / f / g
Here you specify / foo / bar / a / b / c and / foo / bar / e / f / g if you specify / foo / bar as your parent directory
-R said copy factor, if not specified, the default is 10.
Other things to note is that using the hadoop archive tool produces a mapreduce program, so first make sure the cluster is able to run mapreduce.
The following example shows a single directory / foo / bar archiving:
hadoop archive -archiveName zoo.har -p / foo / bar -r 3 / outputdir
The following example shows right
View created Archive
The archive itself is exposed to the outside world as a file system layer, so all commands for the hadoop fs shell can be run on the archive. The difference is, you can not continue to use hdfs: / / host: 8020 URI, but to use the archive file system URI. Also note that the archive file is immutable, therefore, you can not delete and rename the archive file. The URI for Hadoop Archives is:
har: // scheme-hostname: port / archivepath / fileinarchive Unarchive
Because you can use all the fs shell commands on the archive file system and have access to the archive files transparently, you can unarchive them directly.
Such as:
hdfs dfs -cp har: ///user/zoo/foo.har/dir1 hdfs: / user / zoo / newdir
Or use distcp
hadoop distcp har: ///user/zoo/foo.har/dir1 hdfs: / user / zoo / newdir Take a coherent example (from the official)
Create Archive
hadoop archive -archiveName foo.har -p / user / hadoop -r 3 dir1 dir2 / user / zoo
The above sentence uses / user / hadoop as a relative path to create an archive file called foo.har, which contains / user / hadoop / dir1 and / user / hadoop / dir2. This command will not delete / user / hadoop / dir1 and / user / hadoop / dir2. You can only delete it manually if you want to delete the input file after creating the archive file (this can reduce the namespace).
View Archive
hadoop fs -ls har: ///user/zoo/foo.har
The output is as follows
har: ///user/zoo/foo.har/dir1 har: ///user/zoo/foo.har/dir2