Archive history files with Hadoop har (small files)

Source: Internet
Author: User
Tags hadoop fs
Application Scenarios

Keeping a large number of small files in our HDFs (and of course not producing small files is a best practice) will make Namenode's namespace a big deal. The namespace holds the Inode information for the HDFs file, and the more files it needs, the greater the Namenode memory, but the memory is limited after all (this is the current Hadoop mishap).

The following image shows the structure of the Har document. The Har file is generated through MapReduce, and the source file is not deleted after the job ends.


har command Description parameter "-P" for src path prefix src can write multiple path

Archive-archivename name-p <parent path> <src>* <dest> generate har file single src folder:

Hadoop archive-archivename 419.har-p/fc/src/20120116/419/user/heipark multiple src folders Hadoop archive-archivename combine.ha R-p/fc/src/20120116/419 334/user/heipark do not specify SRC path, directly archive the parent path (this example is "/fc/src/20120116/", "/user/heipark" is still the output Out path), this trick is from the source of the upside out, hey. Hadoop archive-archivename combine.har-p/fc/src/20120116//user/heipark

Using the pattern matching src path, the following example archives data for folders 10, 11, December. The trick is also from the source. Hadoop archive-archivename combine.har-p/fc/src/2011 1[0-2]/user/heipark view har file

Hadoop fs-ls har:////user/heipark/20120108_15.har/
#输出如下:
drw-r--r---hdfs Hadoop 0 2012-01-17 16:30/user/ heipark/20120108_15.har/2025
drw-r--r---hdfs hadoop 0 2012-01-17 16:30/user/heipark/20120108_15.har/2029

 
# View har files using the HDFs file system
Hadoop fs-ls/user/yue.zhang/20120108_15.har/
#输出如下:
-rw-r--r--2 hdfs Hadoop 0 2012-01-17 16:30/user/heipark/20120108_15.har/_success
-rw-r--r--5 hdfs hadoop 2411 2012-01-17 16:30/user/ Heipark/20120108_15.har/_index
-rw-r--r--5 hdfs Hadoop 2012-01-17 16:30/user/heipark/20120108_15.har/_ Masterindex
-rw-r--r--2 hdfs hadoop 191963 2012-01-17 16:30/user/heipark/20120108_15.har/part-0

Har Java API (harfilesystem) Java code

    public static void Main (string[] args) throws Exception {  
        configuration conf = new Configuration ();  
        Conf.set ("Fs.default.name", "hdfs://xxx.xxx.xxx.xxx:9000");  
              
        Harfilesystem fs = new Harfilesystem ();  
        Fs.initialize (New URI ("Har:///user/heipark/20120108_15.har"), conf);  
        filestatus[] Liststatus = fs.liststatus (New Path ("Sub_dir"));  
        for (Filestatus filestatus:liststatus) {  
            System.out.println (Filestatus.getpath (). toString ());  
        }  
    }  

Reference article:

http://denqiang.com/?m=20111114

http://c.hocobo.net/2010/08/05/har/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.