Code test environment: hadoop2.4
Application Scenario: When you need to process many small data files, you can use this technique to efficiently process data.
Principle: combinefileinputformat can be used to merge multiple small data files during sharding. Since each part generates a Mapper, when a Mapper processes less data, the efficiency is low. In general, when hadoop is used to process data, the default method is to treat an input data file as a shard, so that when the input file is small, the efficiency will be low.
Instance:
Refer to the previous blog: hadoop programming tips (5) --- custom input file format class inputformat. However, this input uses two input files, both of which are small data files.
Custom input file format: mmcombinefileinputformat:
Package FZ. combineinputformat; import Java. io. ioexception; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. inputsplit; import Org. apache. hadoop. mapreduce. recordreader; import Org. apache. hadoop. mapreduce. taskattemptcontext; import Org. apache. hadoop. mapreduce. lib. input. combinefileinputformat; import Org. apache. hadoop. mapreduce. lib. input. combinefilerecordreader; import Org. apache. hadoop. mapreduce. lib. input. combinefilesplit;/*** defines the read class * @ author fansy **/public class extends combinefileinputformat <text, text >{@ overridepublic recordreader <text, text> createrecordreader (inputsplit, taskattemptcontext context) throws ioexception {// todo auto-generated method stubreturn new combinefilerecordreader <text, text> (combinefilesplit) split, context, customcombinereader. class );}}
Custom record reading class customcombinereader:
Package FZ. combineinputformat; import Java. io. ioexception; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. inputsplit; import Org. apache. hadoop. mapreduce. recordreader; import Org. apache. hadoop. mapreduce. taskattemptcontext; import Org. apache. hadoop. mapreduce. lib. input. combinefilesplit; import Org. apache. hadoop. mapreduce. lib. input. filesplit;/*** modify the initialization function * @ author fansy **/public class customcombinereader extends recordreader <text, text> {private int index; private customreader in; public customcombinereader (combinefilesplit split, taskattemptcontext cxt, integer index) {This. index = index; this. in = new customreader () ;}@ overridepublic void initialize (inputsplit split, taskattemptcontext context) throws ioexception, bytes {combinefilesplit cfsplit = (combinefilesplit) split; filesplit = new filesplit (cfsplit. getpath (INDEX), cfsplit. getoffset (INDEX), cfsplit. getlength (), cfsplit. getlocations (); In. initialize (filesplit, context) ;}@ overridepublic Boolean nextkeyvalue () throws ioexception, interruptedexception {return in. nextkeyvalue () ;}@ overridepublic text getcurrentkey () throws ioexception, interruptedexception {// todo auto-generated method stubreturn in. getcurrentkey () ;}@ overridepublic text getcurrentvalue () throws ioexception, interruptedexception {// todo auto-generated method stubreturn in. getcurrentvalue () ;}@ overridepublic float getprogress () throws ioexception, interruptedexception {// todo auto-generated method stubreturn in. getprogress () ;}@ overridepublic void close () throws ioexception {// todo auto-generated method stubin. close ();}}
We can see that this class uses the customreader class of the previous blog. It only modifies the initialization function so that files with a small amount of data can be merged into one partition. For details about customreader, refer to the previous blog: hadoop programming tips (5) --- custom input file format class inputformat.
Main class, only need to modify (also refer to the previous blog ):
job.setInputFormatClass(CustomCombineFileInputFormat.class);
Two experiments were conducted. The first was to use combinefileinputformat for reading, and the second was to use textinputformat for reading.
View results:
First, we can see from the terminal:
The same two input files are displayed. Task 096 has only one partition, and task 097 has two partitions;
You can also see the number of ER er changes on the task monitoring page:
Conclusion: combinefileinputformat has a strong application value and has a high processing efficiency benefit for a large amount of small data. However, for big data applications, the input data may usually be large. Therefore, this situation is only applicable to some special situations.
Share, grow, and be happy
Reprinted please indicate blog address: http://blog.csdn.net/fansy1990