Brief introduction
We studied in Hadoop: (i)--hdfs introduction has said, HDFs is not good at storing small files, because each file at least one block, the metadata of each blocks will occupy the Namenode node memory, if there are such a large number of small files, they will eat Namenode a large amount of memory for a node.
Hadoop archives can effectively handle the above issues, he can file a number of files into a file, archived into a file can also be transparent access to each file, and can be as input to the MapReduce task.
Usage
Hadoop archives can be created using the Archive tool, as in the distcp of the previous article, archive is also a mapreduce task. Let's first look at my directory structure:
[Hadoop@namenode ~] $hadoop FS-LSR
drwxr-xr-x -Hadoop supergroup 0 2013-06-20 12:37/user/hadoop/har
Drwxr-xr-x -Hadoop supergroup 0 2013-05-23 11:35/user/hadoop/input
-rw-r--r-- 2 Hadoop supergroup 888190 2013-05-23 11:35/user/hadoop/input/1901
-rw-r--r-- 2 hadoop supergroup 888978 2013-05-23 11:35/user/hadoop/input/1902
-rw-r--r-- 2 hadoop supergroup 293 2013-06-02 17:44/user/hadoop/ News.txt
We are archiving this directory through the Archive tool
Hadoop archive-archivename input.har-p/user/hadoop/input har
Archivename Specify archive file name,-p for the parent directory, you can put multiple directory files into the archive, we look at the creation of a good har file.
<div style= "Text-align:left;" ><span style= "font-
family:arial, song body, Sans-serif; ">[hadoop@namenode ~] $hadoop fs-ls
har</span></div>found 1 Items
drwxr-xr-x -Hadoop SuperGroup 0 2013-06-20 12:38/user/hadoop/har/input.har
[hadoop@namenode ~] $hadoop fs-ls
Har/input.har Found 4 Items
-rw-r--r-- 2 hadoop supergroup 0 2013-06-20 12:38/user/hadoop/har/input.har/_success
-rw-r--r-- 5 hadoop supergroup 272 2013-06-20 12:38/user/hadoop/har/input.har/_index
- rw-r--r-- 5 hadoop supergroup 2013-06-20 12:38
/user/hadoop/har/input.har/_masterindex
- rw-r--r-- 2 hadoop supergroup 1777168 2013-06-20 12:38/user/hadoop/har/input.har/part-
0
Here you can see that the Har file includes two index files, multiple part files, and only one here. Part file is a collection of multiple original files, according to the index file to find the original file.
If accessed using the Har Uri, the files will be hidden, displaying only the original file
[Hadoop@namenode ~] $hadoop fs-lsr har:///user/hadoop/har/input.har
drwxr-xr-x -Hadoop supergroup 0 2013-05-23 11:35/user/hadoop/har/input.har/input
-rw-r--r-- 2 hadoop supergroup 888978 2013-05-23 11:35
/user/hadoop/har/input.har/input/1902
-rw-r--r-- 2 hadoop supergroup 888190 2013-05-23 11:35
/user/hadoop/har/input.har/input/1901
You can also access files at the Har level like the normal file system
[Hadoop@namenode ~] $hadoop fs-lsr har:///user/hadoop/har/input.har/input
-rw-r--r-- 2 Hadoop supergroup 888978 2013-05-23 11:35
/user/hadoop/har/input.har/input/1902
-rw-r--r-- 2 Hadoop supergroup 888190 2013-05-23 11:35
/user/hadoop/har/input.har/input/1901
If you want to access remotely, you can use the following command
[Hadoop@namenode ~] $hadoop fs-lsr har://hdfs-namenode:9000/user/hadoop/har/input.har/input
-rw-r--r-- 2 Hadoop supergroup 888978 2013-05-23 11:35
/user/hadoop/har/input.har/input/1902
-rw-r--r-- 2 Hadoop supergroup 888190 2013-05-23 11:35
/user/hadoop/har/input.har/input/1901
Har opening description When the Har file system, hdfs-domain name: port, har file system conversion until the end of the Har file, the example will be converted to Hdfs://namenode:9000/user/hadoop/har/input.har, the remaining Part still opens in Archive mode: input
Deleting a file is relatively simple, but it needs to be deleted recursively, otherwise the error
[Hadoop@namenode ~] $hadoop fs-rmr har/input.har
Deleted Hdfs://192.168.115.5:9000/user/hadoop/har/input.har
Limit
The archive file has some limitations:
1. Create archive files to consume as much hard disk space as the original file
2.archive files do not support compression, although the archive file looks like it has been compressed.
3.archive files cannot be changed once they are created, which means that you have to change something, you need to innovate create archive files
4. Although the Namenode memory space problem has been solved, but in the implementation of MapReduce, will be a number of small files to the same mapreduce to split, which is obviously inefficient
The problem with Namenode memory can refer to the HDFs federation in the previous article.