Hadoop uses Multipleinputs/multiinputformat to implement a mapreduce job that reads files in different formats

Last Update:2018-07-26 Source: Internet

Author: User

Tags map class hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop provides multioutputformat to output data to different directories and Fileinputformat to read multiple directories at once, but the default one job can only use Job.setinputformatclass Set up to process data in one format using a inputfomat. If you need to implement the ability to read different format files from different directories at the same time in a job, you will need to implement a multiinputformat to read the files in different formats (it has already provided the multipleinputs).

For example, there is a mapreduce job that needs to read two formats of data at the same time, one format is a plain text file, and one line reads with Linerecordreader, and the other is a pseudo-XML file that is read with a custom ajoinrecordreader.

Self-fulfilling a simple multiinputformat as follows:

Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.InputSplit;
Import Org.apache.hadoop.mapreduce.RecordReader;
Import Org.apache.hadoop.mapreduce.TaskAttemptContext;
Import Org.apache.hadoop.mapreduce.lib.input.FileSplit;
Import Org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
Import Org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

public class Multiinputformat extends Textinputformat {

@Override
Public recordreader<longwritable, Text> Createrecordreader (inputsplit split, Taskattemptcontext context) {
Recordreader reader = null;
try {
String inputfile = ((filesplit) split). GetPath (). toString ();
String Xmlpath = Context.getconfiguration (). Get ("Xml_prefix");
String Textpath = Context.getconfiguration (). Get ("Text_prefix");

if ( -1! = Inputfile.indexof (Xmlpath)) {
reader = new Ajoinrecordreader ();
} else if ( -1! = Inputfile.indexof (Textpath)) {
reader = new Linerecordreader ();
} else {
reader = new Linerecordreader ();
}
} catch (IOException e) {
Do something ...
}

return reader;
}
}
In fact, the principle is very simple, that is, in the Createrecordreader, through ((Filesplit) split). GetPath (). ToString () Gets the name of the file that is currently being processed, and then, depending on the feature, selects the corresponding Recordreader can be. Xml_prefix and Text_prefix can be passed to the configuration via-D at program startup.

For example, the value of a print execution is as follows:

inputfile=hdfs://test042092.sqa.cm4:9000/test/input_xml/common-part-00068
Xmlpath_prefix=hdfs://test042092.sqa.cm4:9000/test/input_xml
Textpath_prefix=hdfs://test042092.sqa.cm4:9000/test/input_txt
This is done only with simple file path and identifier matching, or more complex methods, such as filenames, file suffixes, and so on.

Then in the map class, you can also do different processing according to different file name characteristics:

@Override
public void Map (longwritable offset, Text invalue, Context context)
Throws IOException {

String inputfile = ((filesplit) Context.getinputsplit ()). GetPath ()
. toString ();

if ( -1! = Inputfile.indexof (Textpath)) {
......
} else if ( -1! = Inputfile.indexof (Xmlpath)) {
......
} else {
......
}
}
This way is too earthy, originally Hadoop has provided multipleinputs to implement a directory to specify a inputformat and corresponding map processing class.

Multipleinputs.addinputpath (conf, new Path ("/foo"), Textinputformat.class,
Mapclass.class);
Multipleinputs.addinputpath (conf, new Path ("/bar"),
Keyvaluetextinputformat.class, Mapclass2.class);
Related articles

June 27, 2014 optimization of a Hadoop program – implemented Combinefileinputformat based on the actual size of the file
January 9, 2012 using Sequecefile+lzo format data in Hadoop MapReduce and Hive
March 11, 2014 Hadoop cluster Datanode: "Diskchecker$diskerrorexception:invalid volume failure config value:1"
October 12, 2013 the MapReduce job allows a file to be handled by only one map
September 15, 2013 Java.io.IOException:Max block location exceeded for split exception

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More