Well, I admit it's cool to use Hadoop to handle big data. But sometimes I get frustrated when I do marshalling project.
Many times we use a join in a map-reduce task, so the entire job's input may be more than two files (in other words: Mapper to process more than two files).
How to handle multiple inputs with mapper:
Multiple mapper: Each mapper processes the corresponding input file Https://github.com/zhouhao/Hadoop_Project1/blob/master/MapReduceQueries/Query3/query3.java
multipleinputs.addinputpath (conf, new Path (Args[0)), Textinputformat.class, Customermap.class); Multipleinputs.addinputpath (conf, new Path (args[1)), Textinputformat.class, Transactionmap.class); Fileoutputformat.setoutputpath (conf, new Path (args[2));
A mapper: A mapper process all the different files (the following code snippet, within the mapper, we can data from which file, and then processed accordingly)
public static class Map extends Mapreducebase implements Mapper<longwritable, text, text, text> {public void map ( Longwritable key, Text value, outputcollector<text,text> output, Reporter Reporter) throws IOException {//get FileName from reporter Filesplit Filesplit = (filesplit) reporter.getinputsplit (); String filename = Filesplit.getpath (). GetName (); String line = value.tostring (); Output.collect (new Text (filename), value); } }
Ps:mapper input can be a folder: Fileinputformat.setinputpaths (conf, new Path ("/tmp/");