Hadoop uses Multipleinputs/multiinputformat to implement a mapreduce job that reads files in different formats

Source: Internet
Author: User
Tags map class hadoop mapreduce
Hadoop provides multioutputformat to output data to different directories and Fileinputformat to read multiple directories at once, but the default one job can only use Job.setinputformatclass Set up to process data in one format using a inputfomat. If you need to implement the ability to read different format files from different directories at the same time in a job, you will need to implement a multiinputformat to read the files in different formats (it has already provided the multipleinputs).


For example, there is a mapreduce job that needs to read two formats of data at the same time, one format is a plain text file, and one line reads with Linerecordreader, and the other is a pseudo-XML file that is read with a custom ajoinrecordreader.


Self-fulfilling a simple multiinputformat as follows:


Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.InputSplit;
Import Org.apache.hadoop.mapreduce.RecordReader;
Import Org.apache.hadoop.mapreduce.TaskAttemptContext;
Import Org.apache.hadoop.mapreduce.lib.input.FileSplit;
Import Org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
Import Org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

public class Multiinputformat extends Textinputformat {

@Override
Public recordreader<longwritable, Text> Createrecordreader (inputsplit split, Taskattemptcontext context) {
Recordreader reader = null;
try {
String inputfile = ((filesplit) split). GetPath (). toString ();
String Xmlpath = Context.getconfiguration (). Get ("Xml_prefix");
String Textpath = Context.getconfiguration (). Get ("Text_prefix");

if ( -1! = Inputfile.indexof (Xmlpath)) {
reader = new Ajoinrecordreader ();
} else if ( -1! = Inputfile.indexof (Textpath)) {
reader = new Linerecordreader ();
} else {
reader = new Linerecordreader ();
}
} catch (IOException e) {
Do something ...
}

return reader;
}
}
In fact, the principle is very simple, that is, in the Createrecordreader, through ((Filesplit) split). GetPath (). ToString () Gets the name of the file that is currently being processed, and then, depending on the feature, selects the corresponding Recordreader can be. Xml_prefix and Text_prefix can be passed to the configuration via-D at program startup.


For example, the value of a print execution is as follows:


inputfile=hdfs://test042092.sqa.cm4:9000/test/input_xml/common-part-00068
Xmlpath_prefix=hdfs://test042092.sqa.cm4:9000/test/input_xml
Textpath_prefix=hdfs://test042092.sqa.cm4:9000/test/input_txt
This is done only with simple file path and identifier matching, or more complex methods, such as filenames, file suffixes, and so on.


Then in the map class, you can also do different processing according to different file name characteristics:


@Override
public void Map (longwritable offset, Text invalue, Context context)
Throws IOException {

String inputfile = ((filesplit) Context.getinputsplit ()). GetPath ()
. toString ();

if ( -1! = Inputfile.indexof (Textpath)) {
......
} else if ( -1! = Inputfile.indexof (Xmlpath)) {
......
} else {
......
}
}
This way is too earthy, originally Hadoop has provided multipleinputs to implement a directory to specify a inputformat and corresponding map processing class.


Multipleinputs.addinputpath (conf, new Path ("/foo"), Textinputformat.class,
Mapclass.class);
Multipleinputs.addinputpath (conf, new Path ("/bar"),
Keyvaluetextinputformat.class, Mapclass2.class);
Related articles


June 27, 2014 optimization of a Hadoop program – implemented Combinefileinputformat based on the actual size of the file
January 9, 2012 using Sequecefile+lzo format data in Hadoop MapReduce and Hive
March 11, 2014 Hadoop cluster Datanode: "Diskchecker$diskerrorexception:invalid volume failure config value:1"
October 12, 2013 the MapReduce job allows a file to be handled by only one map
September 15, 2013 Java.io.IOException:Max block location exceeded for split exception

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.