One of the solutions to Hadoop small files Hadoop archive

Source: Internet
Author: User
Tags hadoop fs

Introduction HDFs is not good at storing small files, because each file at least one block, each block of metadata will occupy memory in the Namenode node, if there are such a large number of small files, they will eat the Namenode node's large amount of memory. Hadoop archives can effectively handle these issues, he can archive multiple files into a file, archived into a file can also be transparent access to each file, and can be used as a mapreduce task input.

Usage Hadoop archives can be created using the Archive tool, as in the previous distcp, archive is also a mapreduce task. First, let's look at my directory structure:

[Plain]View Plaincopy
  1. [[Email protected] ~] $hadoop FS-LSR
  2. Drwxr-xr-x-hadoop supergroup 0 2013-06-20 12:37/user/hadoop/har
  3. Drwxr-xr-x-hadoop supergroup 0 2013-05-23 11:35/user/hadoop/input
  4. -rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/input/1901
  5. -rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/input/1902
  6. -rw-r--r--2 hadoop supergroup 293 2013-06-02 17:44/user/hadoop/news.txt
We archive this directory with the archive tool [Plain]View Plaincopy
    1. Hadoop archive-archivename input.har-p/user/hadoop/input har
Archivename Specify the file name of archive,-p for the parent directory, you can put more than one directory file into the archive, we look at the creation of a good har file. [Plain]View Plaincopy
  1. <div style= "Text-align:left;" ><span style= "font-family:arial, song body, Sans-serif; ">[[email protected] ~] $hadoop fs-ls har</span></div>found 1 items
  2. Drwxr-xr-x-hadoop supergroup 0 2013-06-20 12:38/user/hadoop/har/input.har
  3. [[Email protected] ~] $hadoop fs-ls Har/input.har
  4. Found 4 Items
  5. -rw-r--r--2 hadoop supergroup 0 2013-06-20 12:38/user/hadoop/har/input.har/_success
  6. -rw-r--r--5 Hadoop supergroup 272 2013-06-20 12:38/user/hadoop/har/input.har/_index
  7. -rw-r--r--5 Hadoop supergroup 2013-06-20 12:38/user/hadoop/har/input.har/_masterindex
  8. -rw-r--r--2 hadoop supergroup 1777168 2013-06-20 12:38/user/hadoop/har/input.har/part-0
Here you can see the Har file including, two index files, multiple part files, only one is shown here. The part file is a collection of multiple original files, and the original file is found according to the index file. If you use the Har URI to access it, the files will be hidden, showing only the original file. [Plain]View Plaincopy
    1. [[Email protected] ~] $hadoop FS-LSR Har:///user/hadoop/har/input.har
    2. Drwxr-xr-x-hadoop supergroup 0 2013-05-23 11:35/user/hadoop/har/input.har/input
    3. -rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/har/input.har/input/1902
    4. -rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/har/input.har/input/1901
You can also access the Har-level file like a normal file system [Plain]View Plaincopy
    1. [[Email protected] ~] $hadoop FS-LSR har:///user/hadoop/har/input.har/input
    2. -rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/har/input.har/input/1902
    3. -rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/har/input.har/input/1901
If you want to access remotely, you can use the following command [Plain]View Plaincopy
    1. [[Email protected] ~] $hadoop FS-LSR har://hdfs-namenode:9000/user/hadoop/har/input.har/input
    2. -rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/har/input.har/input/1902
    3. -rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/har/input.har/input/1901
Har at the beginning of the description of the Har file system, hdfs-domain name: port, har file system conversion until the end of the Har file, the example will be converted to Hdfs://namenode:9000/user/hadoop/har/input.har, The remaining parts are still open in Archive mode: Input Delete file is relatively simple, but need to be deleted recursively, otherwise error [Plain]View Plaincopy
    1. [[Email protected] ~] $hadoop FS-RMR Har/input.har
    2. Deleted Hdfs://192.168.115.5:9000/user/hadoop/har/input.har
Limit

Archive files have some limitations:

1. Create a archive file to consume as much hard disk space as the original file

The 2.archive file does not support compression, although the archive file looks like it has been compressed.

3.archive files cannot be changed once created, which means you have to change something, you need to innovate. Create a archive file

4. Although the memory space problem of the namenode is resolved, when the MapReduce is executed, multiple small files are handed over to the same mapreduce to split, which is obviously inefficient

The problem of resolving namenode memory can be referenced in the previous article of HDFs Federation.

One of the solutions to Hadoop small files Hadoop archive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.