Small file merging of hive compression

Source: Internet
Author: User

Two small file merging of hive compression

Research background

When the hive input is made up of many small files, because each small file launches a map task, if the file is too small, the map task starts and initializes more than the logical processing time, resulting in wasted resources and even oom. To do this, when we start a task and find that the amount of input data is small but the number of tasks is too large, you need to be aware of input merging at map front. Of course, when we write data to a table, we also need to be aware of the output file size.

Input Merge

Merge input small file, reduce map number?

The main determinants are: The total number of input files, the file size of input, the size of the file block set by the cluster.

Example:
A) Assuming that the input directory has 1 file A and a size of 780M, then Hadoop separates the file a into 7 blocks (6 128m blocks and one 12m block), resulting in 7 map numbers

b) Assuming that the input directory has 3 files a,b,c, the size of 10m,20m,130m, then Hadoop will be separated into 4 blocks (10m,20m,128m,2m), resulting in 4 map number, that is, if the file is larger than the block size (128m), then split, If it is smaller than the block size, the file is treated as a block.

Suppose a SQL task:

Select count (1) from mb_test where date = ' 2015-01-19 ' and hour = 0;

There are 10 files in the/user/hive/warehouse/naga.db/mb_test/date=2015-01-19/hour=0 of this task, Each file size is 3.15M, much smaller than 128M of small files, the total size of 31.5M, normal execution will use 10 map tasks.

In fact, only one map task was generated

Set mapred.max.split.size=100000000;
Set mapred.min.split.size.per.node=100000000;
setmapred.min.split.size.per.rack=100000000;
Sethive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat;

The conclusion is that the merge here is a logical merge, and Hive has already done an input merge by default and we don't need to make too many settings.

Output merge

Merge output small files. Output, if too many small files, each small file will be corresponding to a block, and the meaning of block exists in order to facilitate storage in the Namenode, then too many blocks will be flooded with Namenode table, to the cluster size increase and run times increase, Then the table maintaining the block will be too large, severely reducing namenode performance.


Set Hive.merge.mapfiles = True #在Map the-only task ends when the small file is merged
Set Hive.merge.mapredfiles = True #在Map the-reduce task ends when the small file is merged
Set hive.merge.size.per.task = 256*1000*1000 #合并文件的大小
Set hive.merge.smallfiles.avgsize=16000000 #当输出文件的平均大小小于该值时 to launch a separate map-reduce task for file merge

We execute the following statement:

Insert Overwrite table naga.mb_time selectticks string,starttime bigint, endtime bigint from Naga.mb_pageinfo where start time%10 = 7;

The statement sets the large table by setting the condition starttime%10 = 7 to select three fields to be stored in the small table, observing the number of maps executed and the number of target files:

For example, a two map will be opened throughout the Operation job

And only produced a target file/user/hive/warehouse/naga.db/mb_time/000000_0,

See that our settings are in effect and will not produce two smaller files

Several parameter indicators

Hive.merge.mapfiles

Whether to merge small files at the end of map-only tasks

True

True

Hive None

Hive.merge.mapredfiles

Whether to merge small files at the end of map-reduce tasks

False

True

Hive None

Hive.merge.size.per.task

The size of the merge destination file

256000000

256000000

Hive None

Hive.merge.smallfiles.avgsize

Launches a separate Map-reduce task for file merge when the average size of the output file is less than this value

16000000

5000000

Hive None

All we have to do is set the Hive.merge.smallfiles.avgsize, which is recommended to 5000000 = 5M, that is, when the average size of the output file is less than the value, Launches a separate Map-reduce task for file merge

Small file merging of hive compression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.