Hadoop-archives har Archive history file (small file)

Source: Internet
Author: User
Tags hadoop fs

Application Scenarios

Keeping a large number of small files in our HDFs (and of course not producing small files is a best practice) will make Namenode's namespace a big deal. The namespace holds the Inode information for the HDFs file, and the more files it needs, the greater the Namenode memory, but the memory is limited after all (this is the current Hadoop mishap).

The following image shows the structure of the Har document. The Har file is generated through MapReduce, and the source file is not deleted after the job ends.


Har command Description

1. Archive command

(1). What is Hadoop archives?
Hadoop Archives is a special file format. A Hadoop archive corresponds to a file system directory. The Hadoop archive extension is *.har. Hadoop archive contains metadata (in the form of _index and _masterindx) and data Files (part-*). The _index file contains file name and location information for files in the archive.

(2). How do I create a archive?
Usage: Hadoop archive-archivename NAME <src>* <dest>
Command options:
-archivename name of the file to be created.
The path name of the SRC source file system.
Dest Save the destination directory of the archive file.
Example:
Example 1. Archive/user/hadoop/dir1 and/user/hadoop/dir2 to the/user/zoo/file system directory –/user/zoo/foo.har.
[Email protected]:~/Hadoop archive-archivename foo.har/user/hadoop/dir1/user/hadoop/dir2/user/zoo/
When archive is created, the source file Foo.har is not changed or deleted.

(3). How can I view the files in archives?
Archive is exposed to the outside world as a filesystem layer. So all FS shell commands can be run on archive, but use a different URI. In addition, the archive is not to be changed. So creating, renaming, and deleting all return errors. The URI for Hadoop archives is har://scheme-hostname:port/archivepath/fileinarchive.
If Scheme-hostname is not provided, it will use the default file system. In this case, the URI is this form of har:///archivepath/fileinarchive.
Example:
Example 1.archive input is/dir, the Dir directory contains files Filea and Fileb, now/dir archived to/user/hadoop/foo.bar.
[Email protected]:~/Hadoop archive-archivename Foo.har/dir/user/hadoop
Example 2. Get the list of files in the created archive
[Email protected]:~/hadoop DFS-LSR Har:///user/hadoop/foo.har
Example 3. View the Filea file in archive
[Email protected]:~/hadoop dfs-cat Har:///user/hadoop/foo.har/dir/filea

Generate Har file
    • Single src folder:

Hadoop archive-archivename 419.har-p/fc/src/20120116/419/user/heipark
    • multiple src folders
Hadoop archive-archivename combine.har-p/fc/src/20120116/419 334/user/heipark
    • Do not specify SRC path, directly archive the parent path (this example is "/fc/src/20120116/", "/user/heipark" is still output path), this trick is from the source, hey.
Hadoop archive-archivename combine.har-p/fc/src/20120116//user/heipark

    • Using the pattern matching src path, the following example archives data for folders 10, 11, December. The trick is also from the source.
Hadoop archive-archivename combine.har-p/fc/src/2011 1[0-2]/user/heipark view har file Hadoop fs-ls har:////user/heipark/20 120108_15.har/
#输出如下:
Drw-r--r---HDFs Hadoop 0 2012-01-17 16:30/user/heipark/20120108_15.har/2025
Drw-r--r---HDFs Hadoop 0 2012-01-17 16:30/user/heipark/20120108_15.har/2029

#使用hdfs文件系统查看har文件
Hadoop fs-ls/user/yue.zhang/20120108_15.har/
#输出如下:
-rw-r--r--2 hdfs Hadoop 0 2012-01-17 16:30/user/heipark/20120108_15.har/_success
-rw-r--r--5 HDFs Hadoop 2411 2012-01-17 16:30/user/heipark/20120108_15.har/_index
-rw-r--r--5 HDFs Hadoop 2012-01-17 16:30/user/heipark/20120108_15.har/_masterindex
-rw-r--r--2 HDFs Hadoop 191963 2012-01-17 16:30/user/heipark/20120108_15.har/part-0

Har Java API (Harfilesystem)

public static void Main (string[] args) throws Exception {configuration conf = new Configuration (); Conf.set ("Fs.default.na Me "," hdfs://xxx.xxx.xxx.xxx:9000 "); Harfilesystem fs = new Harfilesystem (); Fs.initialize (New URI ("Har:///user/heipark/20120108_15.har"), conf); filestatus[] Liststatus = fs.liststatus (New Path ("Sub_dir")); for (Filestatus Filestatus:liststatus) { System.out.println (Filestatus.getpath (). toString ());}}

  

Hadoop-archives har Archive history file (small file)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.