[Hive-languagemanual] Archiving for File Count Reduction

Source: Internet
Author: User

Archiving for File Count Reduction

Note:archiving should be considered a advanced command due to the caveats involved.

    • Archiving for File Count Reduction
      • Overview
      • Settings
      • Usage
        • Archive
        • Unarchive
      • Cautions and limitations
      • Under the Hood

Overview

Due to the design of HDFS, the number of files in the filesystem directly affects the memory consumption(consumption) in t He namenode. While normally isn't a problem for small clusters, memory usage may hits the limits of accessible memory on a When there is >50-100 million files. In the such situations, it is advantageous (advantageous) to has the as few files as possible.

The use of Hadoop Archives is one approach (pathway) to reducing the number of files in partitions. (Reduce the number of files in the partition) Hive has built-in support to convert files in existing partitions to a Hadoop Archive (HAR) so that a partiti On this may once has consisted of "s of files can occupy just to files (depending on settings). However, the trade-off (trading, trade-off) is this queries may was slower due to the additional overhead in reading from the HAR. (but may be slightly slower when reading data)

Note that archiving does isn't compress the Files–har is analogous to the Unix tar command.

archiving is not a compressed file, very similar to the TAR Command for UNIX systems (as I understand it: package only, do not compress )

tar-zcvf/tmp/etc.tar.gz  /  etc<==  -jcvf/tmp/etc.tar.bz2/etc<==-zxvf/tmp/ etc.tar.gz   -JXVF/TMP/ETC.TAR.BZ2 Decompression

Settings

There is 3 settings this should be configured before archiving is used. (Example values are shown.)

hive> set hive.archive.enabled=true;hive> set hive.archive.har.parentdir.settable=true;hive> set har.partfile.size=1099511627776;

hive.archive.enabledControls whether archiving operations is enabled.

hive.archive.har.parentdir.settable  informs Hive Whether the parent directory can is set while Creating the archive. In recent versions of Hadoop the  -P  option can specify the root directory of the archive. For example, if  /dir1/dir2/file  is archived with  /dir1  as the parent directory, then the resulting archive file would contain the directory structure  dir2/file . In older versions of Hadoop (prior to), this option is not available and therefore Hive must is configured to Accomm Odate (FIT) this limitation.

Controls the size of the files that do up the har.partfile.size archive. The archive'll contain size_of_partition / har.partfile.size files, rounded up. Higher values mean fewer files, but would result in longer archiving times due to the reduced number of mappers.

Usagearchive

Once the configuration values is set, a partition can is archived with the command:

ALTER TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

For example:

ALTER TABLE srcpart ARCHIVE PARTITION(ds=‘2008-04-08‘, hr=‘12‘)

Once The command is issued, a mapreduce job would perform the archiving. Unlike Hive queries, there is no output on the CLI to indicate process.

Unarchive

The partition can reverted back to its original files with the unarchive command:

ALTER TABLE srcpart UNARCHIVE PARTITION(ds=‘2008-04-08‘, hr=‘12‘)
Cautions and limitations warnings and restrictions
    • In some older versions of Hadoop, HAR had a few bugs the could cause data loss or other errors. Be sure that these patches is integrated into your version of Hadoop:

https://issues.apache.org/jira/browse/HADOOP-6591 (fixed in HADOOP 0.21.0)

https://issues.apache.org/jira/browse/MAPREDUCE-1548 (fixed in Hadoop 0.22.0)

https://issues.apache.org/jira/browse/MAPREDUCE-2143 (fixed in Hadoop 0.22.0)

https://issues.apache.org/jira/browse/MAPREDUCE-1752 (fixed in Hadoop 0.23.0)

    • The Harfilesystem class still has a bug with yet to be fixed:

https://issues.apache.org/jira/browse/MAPREDUCE-1877 (moved to https://issues.apache.org/jira/browse/HADOOP-10906 In 2014)

Hive comes with the Hiveharfilesystem class, addresses some of these issues, and are by default the value for fs.har.impl . Keep This on mind if you ' re rolling your own version of Harfilesystem:

    • The default hiveharfilesystem.getfileblocklocations () has no locality. That's means it may introduce higher network loads or reduced performance.
    • Archived partitions cannot is overwritten with INSERT OVERWRITE. The partition must be unarchived first.
    • If the processes attempt to archive the same partition at the same time, bad things could happen. (need to implement concurrency support.)
Under the Hood

Internally, when a partition is archived, a HAR was created using the files from the partition's original location (such as /warehouse/table/ds=1). The parent directory of the partition is specified to being the same as the original location and the resulting archive are NA Med ' Data.har '. The archive is moved under the original directory (such /warehouse/table/ds=1/data.har as), and the partition's location is changed Archive.

[hive-languagemanual] archiving for File Count Reduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.