Hadoop programming tips (6) --- process a large number of small data files in the combinefileinputformat Application

Source: Internet
Author: User

Code test environment: hadoop2.4

Application Scenario: When you need to process many small data files, you can use this technique to efficiently process data.

Principle: combinefileinputformat can be used to merge multiple small data files during sharding. Since each part generates a Mapper, when a Mapper processes less data, the efficiency is low. In general, when hadoop is used to process data, the default method is to treat an input data file as a shard, so that when the input file is small, the efficiency will be low.

Instance:

Refer to the previous blog: hadoop programming tips (5) --- custom input file format class inputformat. However, this input uses two input files, both of which are small data files.

Custom input file format: mmcombinefileinputformat:

Package FZ. combineinputformat; import Java. io. ioexception; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. inputsplit; import Org. apache. hadoop. mapreduce. recordreader; import Org. apache. hadoop. mapreduce. taskattemptcontext; import Org. apache. hadoop. mapreduce. lib. input. combinefileinputformat; import Org. apache. hadoop. mapreduce. lib. input. combinefilerecordreader; import Org. apache. hadoop. mapreduce. lib. input. combinefilesplit;/*** defines the read class * @ author fansy **/public class extends combinefileinputformat <text, text >{@ overridepublic recordreader <text, text> createrecordreader (inputsplit, taskattemptcontext context) throws ioexception {// todo auto-generated method stubreturn new combinefilerecordreader <text, text> (combinefilesplit) split, context, customcombinereader. class );}}

Custom record reading class customcombinereader:

Package FZ. combineinputformat; import Java. io. ioexception; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. inputsplit; import Org. apache. hadoop. mapreduce. recordreader; import Org. apache. hadoop. mapreduce. taskattemptcontext; import Org. apache. hadoop. mapreduce. lib. input. combinefilesplit; import Org. apache. hadoop. mapreduce. lib. input. filesplit;/*** modify the initialization function * @ author fansy **/public class customcombinereader extends recordreader <text, text> {private int index; private customreader in; public customcombinereader (combinefilesplit split, taskattemptcontext cxt, integer index) {This. index = index; this. in = new customreader () ;}@ overridepublic void initialize (inputsplit split, taskattemptcontext context) throws ioexception, bytes {combinefilesplit cfsplit = (combinefilesplit) split; filesplit = new filesplit (cfsplit. getpath (INDEX), cfsplit. getoffset (INDEX), cfsplit. getlength (), cfsplit. getlocations (); In. initialize (filesplit, context) ;}@ overridepublic Boolean nextkeyvalue () throws ioexception, interruptedexception {return in. nextkeyvalue () ;}@ overridepublic text getcurrentkey () throws ioexception, interruptedexception {// todo auto-generated method stubreturn in. getcurrentkey () ;}@ overridepublic text getcurrentvalue () throws ioexception, interruptedexception {// todo auto-generated method stubreturn in. getcurrentvalue () ;}@ overridepublic float getprogress () throws ioexception, interruptedexception {// todo auto-generated method stubreturn in. getprogress () ;}@ overridepublic void close () throws ioexception {// todo auto-generated method stubin. close ();}}

We can see that this class uses the customreader class of the previous blog. It only modifies the initialization function so that files with a small amount of data can be merged into one partition. For details about customreader, refer to the previous blog: hadoop programming tips (5) --- custom input file format class inputformat.

Main class, only need to modify (also refer to the previous blog ):

job.setInputFormatClass(CustomCombineFileInputFormat.class);

Two experiments were conducted. The first was to use combinefileinputformat for reading, and the second was to use textinputformat for reading.

View results:

First, we can see from the terminal:



The same two input files are displayed. Task 096 has only one partition, and task 097 has two partitions;

You can also see the number of ER er changes on the task monitoring page:

 


Conclusion: combinefileinputformat has a strong application value and has a high processing efficiency benefit for a large amount of small data. However, for big data applications, the input data may usually be large. Therefore, this situation is only applicable to some special situations.


Share, grow, and be happy

Reprinted please indicate blog address: http://blog.csdn.net/fansy1990


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.