Recently, a complicated SQL statement was executed, and a pile of small files appeared during file output:
To sum up a sentence for merging small files, we can conclude that the number of files is too large, increasing the pressure on namenode. Because the metadata information of each file exists in namenode. Therefore, we need to reduce the volume of small files.
At the same time, it also reduces the number of maps processed by the next program and starts the same number of maps as small files. Increase the JVM pressure.
Control the final file size of hive from two aspects:
(1) control the file size from the data, that is, control the number of maps:
Because mapreduce cannot directly control the number of maps, you can set the data volume processed in each map. Reduce can be set directly.
Set mapred parameter for controlling map and reduce. max. split. size = 256000000; -- determines the maximum file size processed by each map, in bset mapred. min. split. size. per. node = 1024000000; -- the smallest file size that can be processed in the node
Set mapred. Min. Split. Size. Per. Rack = 1024000000; -- Minimum File size that can be processed in the rack
Set hive. Input. format = org. Apache. hadoop. hive. QL. Io. combinehiveinputformat
The preceding three parameters are the file size split by map. This cannot be adjusted through parameters, which can be dynamically set. The second parameter is to merge the files on a node and set the size of a map. The third parameter is to merge the files on each rack.
The order of the three parameters is as follows:
Mapred. Max. Split. Size <= mapred. Min. Split. Size. Per. node <= mapred. Min. Split. Size. Per. Rack
(2) The number of maps is controlled through the above parameters. However, this only controls the number of maps and does not control the size of data files formed by reduce. Therefore, we need to merge files on the reduce side.
Method 1 Set mapred. reduce. tasks = 10; -- set the number of reduce Methods 2 set hive.exe C. specified CERs. bytes. per. CER = 1073741824 -- the data volume processed by each reduce. The default value is 1 GB.
You can set the number of reduce files to control the number of output files at the reduce end. Alternatively, you can set the file size for data entering the reduce end to control the file size, to control the number of reduce file outputs.
In addition to using the preceding example parameters to control the number of reduce tasks, we also need to control the file size formed at the reduce end, so that small files cannot be used.
You can merge the map and reduce result files by configuring the following parameters to eliminate these effects.
- Control the file size after a small file is merged for each task (256000000 by default): hive. Merge. Size. Per. Task
- Tell hadoop what files belong to small files (16000000 by default): hive. Merge. smallfiles. avgsize
- Whether to merge map output files (true by default): hive. Merge. mapfiles
- Whether to merge reduce output files (false by default): hive. Merge. mapredfiles
I have done some experiments on the example of the above-free parameters. For the setting of map quantity, you can directly set it through the set method, my experiment is as follows:
Create Table loan_base_copy as select I. * From loan_base C left join loan_special_repayment I on I. loan_id = C. ID
Execute the preceding statement on the hive command line and convert it to a mapreduce task. In this SQL statement, we want to control the file size output at the reduce end. Here I set the size of hive. Merge. smallfiles. avgsize to 256 MB. The default value is 16 Mb.
This statement means that after hive is executed, if the data on the reduce side is smaller than this number, it will be merged and then hive will be based on the given size. merge. size. per. the size of each merge task in a task is the size of the merged file. Here we set it to 512 M.
The execution result and process are as follows:
To merge the entire reduce task, you need to start a job separately and then merge the data. As for the data set here, the file size is 512 MB, and the final file size is not MB. All in all, this is not a small file.
(It must be clarified that only the files generated by the reduce end are smaller than the hive set. merge. smallfiles. only the avgsize file size can be merged. The size of the merged file is hive. merge. size. per. task)
Merge map input data and reduce output data in hive.