Hadoop HDFS (4) hadoop Archives

Source: Internet
Author: User
Tags hadoop fs

Using HDFS to store small files is not economical, because each file is stored in a block, and the metadata of each block is stored in the namenode memory. Therefore, a large number of small files, it will eat a lot of namenode memory. (Note: A small file occupies one block, but the size of this block is not a set value. For example, each block is set to 128 MB, but a 1 MB file exists in a block, the actual size of datanode hard disk is 1 m, not 128 M. Therefore, the non-economic nature here refers to occupying a large amount of namenode memory resources, rather than occupying a large amount of datanode disk resources .)
Hadoop archives (HAR file) is a file packaging tool that packs files into HDFS to make more effective use of blocks, thus reducing the memory usage of namenode. At the same time, hadoop archives also allows the client to transparently access the files in the Har package, which is as convenient as accessing the files in the folder. More importantly, the Har file can also be used as the mapreduce input.
How to Use hadoop Archives $ Hadoop FS-ls-r/user/Norris/lists all files in the/user/Norris/directory.-R indicates recursively listing files in subdirectories. Then we can run the following command: $ hadoop archive-archivename files. har-P/user/Norris // user/Norris/HAR/this command encodes all the content in the/user/Norris/directory into files. put the Har package under/user/Norris/HAR. -P indicates the parent directory (parent ). Then, use $ hadoop FS-ls/user/Norris/HAR/to view a file. Har file in the/user/Norris/HAR/directory. $ Hadoop FS-ls/user/Norris/HAR/files. Har you can see that the files. Har package consists of two index files and a group of part files. The part file concatenates all the file content. The index file stores the offset and length of the file at the starting position. To view the contents of the Har file, use URI scheme Har to view: $ hadoop FS-ls-r Har: // user/Norris/HAR/files. har lists the files and directories in Har. The HAR file system is located on top of the underlying File System (HDFS.
To delete a har file, use: $ hadoop FS-rm-r/user/Norris/HAR/files. har uses the-r option, because in the view of the underlying file system ,. the HAR file is actually a directory.
Restrictions on the Use of hadoop Archives 1. Create a har file with the same size as the source file. Therefore, before creating a har file, ensure that the disk space of the same size is available. After creating the Har file, you can delete the original file. The directory hadoop archives is only packaged and not compressed. 2. Once a har file is created, it cannot be modified and cannot be added or deleted. In actual use, archive is usually implemented for files that are not changed once generated. For example, the log files generated on the current day are packaged into a package every day. 3. as mentioned earlier, the Har file can be used as the mapreduce input, but it is not more efficient to input multiple small files into a package to mapreduce than to input a single small file to mapreduce, other solutions will be discussed later to solve the problem of efficiency of input of many small files. 4. If the memory of the namenode is insufficient, HDFS Federation should be considered after a large number of small files are reduced. We mentioned before: http://blog.csdn.net/norriszhang/article/details/39178041






Hadoop HDFS (4) hadoop Archives

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.