Introduction HDFs is not good at storing small files, because each file at least one block, each block of metadata will occupy memory in the Namenode node, if there are such a large number of small files, they will eat the Namenode node's large amount of memory. Hadoop archives can effectively handle these issues, he can archive multiple files into a file, archived into a file can also be transparent access to each file, and can be used as a mapreduce task input.
Usage Hadoop archives can be created using the Archive tool, as in the previous distcp, archive is also a mapreduce task. First, let's look at my directory structure:
[Plain]View Plaincopy
- [[Email protected] ~] $hadoop FS-LSR
- Drwxr-xr-x-hadoop supergroup 0 2013-06-20 12:37/user/hadoop/har
- Drwxr-xr-x-hadoop supergroup 0 2013-05-23 11:35/user/hadoop/input
- -rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/input/1901
- -rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/input/1902
- -rw-r--r--2 hadoop supergroup 293 2013-06-02 17:44/user/hadoop/news.txt
We archive this directory with the archive tool
[Plain]View Plaincopy
- Hadoop archive-archivename input.har-p/user/hadoop/input har
Archivename Specify the file name of archive,-p for the parent directory, you can put more than one directory file into the archive, we look at the creation of a good har file.
[Plain]View Plaincopy
- <div style= "Text-align:left;" ><span style= "font-family:arial, song body, Sans-serif; ">[[email protected] ~] $hadoop fs-ls har</span></div>found 1 items
- Drwxr-xr-x-hadoop supergroup 0 2013-06-20 12:38/user/hadoop/har/input.har
- [[Email protected] ~] $hadoop fs-ls Har/input.har
- Found 4 Items
- -rw-r--r--2 hadoop supergroup 0 2013-06-20 12:38/user/hadoop/har/input.har/_success
- -rw-r--r--5 Hadoop supergroup 272 2013-06-20 12:38/user/hadoop/har/input.har/_index
- -rw-r--r--5 Hadoop supergroup 2013-06-20 12:38/user/hadoop/har/input.har/_masterindex
- -rw-r--r--2 hadoop supergroup 1777168 2013-06-20 12:38/user/hadoop/har/input.har/part-0
Here you can see the Har file including, two index files, multiple part files, only one is shown here. The part file is a collection of multiple original files, and the original file is found according to the index file. If you use the Har URI to access it, the files will be hidden, showing only the original file.
[Plain]View Plaincopy
- [[Email protected] ~] $hadoop FS-LSR Har:///user/hadoop/har/input.har
- Drwxr-xr-x-hadoop supergroup 0 2013-05-23 11:35/user/hadoop/har/input.har/input
- -rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/har/input.har/input/1902
- -rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/har/input.har/input/1901
You can also access the Har-level file like a normal file system
[Plain]View Plaincopy
- [[Email protected] ~] $hadoop FS-LSR har:///user/hadoop/har/input.har/input
- -rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/har/input.har/input/1902
- -rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/har/input.har/input/1901
If you want to access remotely, you can use the following command
[Plain]View Plaincopy
- [[Email protected] ~] $hadoop FS-LSR har://hdfs-namenode:9000/user/hadoop/har/input.har/input
- -rw-r--r--2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/har/input.har/input/1902
- -rw-r--r--2 hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/har/input.har/input/1901
Har at the beginning of the description of the Har file system, hdfs-domain name: port, har file system conversion until the end of the Har file, the example will be converted to Hdfs://namenode:9000/user/hadoop/har/input.har, The remaining parts are still open in Archive mode: Input Delete file is relatively simple, but need to be deleted recursively, otherwise error
[Plain]View Plaincopy
- [[Email protected] ~] $hadoop FS-RMR Har/input.har
- Deleted Hdfs://192.168.115.5:9000/user/hadoop/har/input.har
Limit
Archive files have some limitations:
1. Create a archive file to consume as much hard disk space as the original file
The 2.archive file does not support compression, although the archive file looks like it has been compressed.
3.archive files cannot be changed once created, which means you have to change something, you need to innovate. Create a archive file
4. Although the memory space problem of the namenode is resolved, when the MapReduce is executed, multiple small files are handed over to the same mapreduce to split, which is obviously inefficient
The problem of resolving namenode memory can be referenced in the previous article of HDFs Federation.
One of the solutions to Hadoop small files Hadoop archive