Hadoop-archives har Archive history file (small file)

Last Update:2015-12-06 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Application Scenarios

Keeping a large number of small files in our HDFs (and of course not producing small files is a best practice) will make Namenode's namespace a big deal. The namespace holds the Inode information for the HDFs file, and the more files it needs, the greater the Namenode memory, but the memory is limited after all (this is the current Hadoop mishap).

The following image shows the structure of the Har document. The Har file is generated through MapReduce, and the source file is not deleted after the job ends.

Har command Description

1. Archive command

(1). What is Hadoop archives?
Hadoop Archives is a special file format. A Hadoop archive corresponds to a file system directory. The Hadoop archive extension is *.har. Hadoop archive contains metadata (in the form of _index and _masterindx) and data Files (part-*). The _index file contains file name and location information for files in the archive.

(2). How do I create a archive?
Usage: Hadoop archive-archivename NAME <src>* <dest>
Command options:
-archivename name of the file to be created.
The path name of the SRC source file system.
Dest Save the destination directory of the archive file.
Example:
Example 1. Archive/user/hadoop/dir1 and/user/hadoop/dir2 to the/user/zoo/file system directory –/user/zoo/foo.har.
[Email protected]:~/Hadoop archive-archivename foo.har/user/hadoop/dir1/user/hadoop/dir2/user/zoo/
When archive is created, the source file Foo.har is not changed or deleted.

(3). How can I view the files in archives?
Archive is exposed to the outside world as a filesystem layer. So all FS shell commands can be run on archive, but use a different URI. In addition, the archive is not to be changed. So creating, renaming, and deleting all return errors. The URI for Hadoop archives is har://scheme-hostname:port/archivepath/fileinarchive.
If Scheme-hostname is not provided, it will use the default file system. In this case, the URI is this form of har:///archivepath/fileinarchive.
Example:
Example 1.archive input is/dir, the Dir directory contains files Filea and Fileb, now/dir archived to/user/hadoop/foo.bar.
[Email protected]:~/Hadoop archive-archivename Foo.har/dir/user/hadoop
Example 2. Get the list of files in the created archive
[Email protected]:~/hadoop DFS-LSR Har:///user/hadoop/foo.har
Example 3. View the Filea file in archive
[Email protected]:~/hadoop dfs-cat Har:///user/hadoop/foo.har/dir/filea

Generate Har file

Single src folder:

Hadoop archive-archivename 419.har-p/fc/src/20120116/419/user/heipark

multiple src folders

Hadoop archive-archivename combine.har-p/fc/src/20120116/419 334/user/heipark

Do not specify SRC path, directly archive the parent path (this example is "/fc/src/20120116/", "/user/heipark" is still output path), this trick is from the source, hey.

Hadoop archive-archivename combine.har-p/fc/src/20120116//user/heipark

Using the pattern matching src path, the following example archives data for folders 10, 11, December. The trick is also from the source.

Hadoop archive-archivename combine.har-p/fc/src/2011 1[0-2]/user/heipark view har file Hadoop fs-ls har:////user/heipark/20 120108_15.har/
#输出如下:
Drw-r--r---HDFs Hadoop 0 2012-01-17 16:30/user/heipark/20120108_15.har/2025
Drw-r--r---HDFs Hadoop 0 2012-01-17 16:30/user/heipark/20120108_15.har/2029

#使用hdfs文件系统查看har文件
Hadoop fs-ls/user/yue.zhang/20120108_15.har/
#输出如下:
-rw-r--r--2 hdfs Hadoop 0 2012-01-17 16:30/user/heipark/20120108_15.har/_success
-rw-r--r--5 HDFs Hadoop 2411 2012-01-17 16:30/user/heipark/20120108_15.har/_index
-rw-r--r--5 HDFs Hadoop 2012-01-17 16:30/user/heipark/20120108_15.har/_masterindex
-rw-r--r--2 HDFs Hadoop 191963 2012-01-17 16:30/user/heipark/20120108_15.har/part-0

Har Java API (Harfilesystem)

public static void Main (string[] args) throws Exception {configuration conf = new Configuration (); Conf.set ("Fs.default.na Me "," hdfs://xxx.xxx.xxx.xxx:9000 "); Harfilesystem fs = new Harfilesystem (); Fs.initialize (New URI ("Har:///user/heipark/20120108_15.har"), conf); filestatus[] Liststatus = fs.liststatus (New Path ("Sub_dir")); for (Filestatus Filestatus:liststatus) { System.out.println (Filestatus.getpath (). toString ());}}

Hadoop-archives har Archive history file (small file)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More