MapReduce in actual programming "I/O"

Last Update:2015-10-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Through this mapreduce analysis model. Deepen the mapreduce understanding model; and the demo Mapreduc into the programming model is a common lattice type and output lattice formula, in which we are able to expand their input lattice formulas, examples: We need to use MONGO data as input, can expand InputFormat, Inputsplit the way it is implemented.

MapReduce model in-depth understanding

We already know that the inputs and outputs of the map and the reduce functions are key-value pairs, and we start with an in-depth look at the model.

First of all. Analyze a default MapReduce job program.

(1) One of the simplest mapreduce programs

Import Org.apache.hadoop.conf.configured;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.mapred.fileinputformat;import Org.apache.hadoop.mapred.fileoutputformat;import Org.apache.hadoop.mapred.jobclient;import Org.apache.hadoop.mapred.jobconf;import Org.apache.hadoop.util.Tool; Import Org.apache.hadoop.util.toolrunner;public class Minimalmapreduce extends configured implements Tool {@ overridepublic int Run (string[] args) throws Exception {jobconf conf = new jobconf (getconf (), getclass ()); Fileinputformat.addinputpath (conf, new Path ("/test/input/t")); Fileoutputformat.setoutputpath (conf, new Path ("/test/output/t")); Jobclient.runjob (conf); return 0;} public static void Main (string[] args) throws Exception {int exitCode = Toolrunner.run (New Minimalmapreduce (), args); System.exit (ExitCode);}}

(2) function as above, default value display setting

Import Org.apache.hadoop.conf.configured;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapred.fileinputformat;import Org.apache.hadoop.mapred.fileoutputformat;import Org.apache.hadoop.mapred.jobclient;import Org.apache.hadoop.mapred.jobconf;import Org.apache.hadoop.mapred.maprunner;import Org.apache.hadoop.mapred.textinputformat;import Org.apache.hadoop.mapred.textoutputformat;import Org.apache.hadoop.mapred.lib.hashpartitioner;import Org.apache.hadoop.mapred.lib.identitymapper;import Org.apache.hadoop.mapred.lib.identityreducer;import Org.apache.hadoop.util.tool;import Org.apache.hadoop.util.toolrunner;public class Minimalmapreducewithdefaults Extends configured implements Tool {@Overridepublic int run (string[] args) throws Exception {jobconf conf = new Jobconf (ge Tconf (), getclass ()); Fileinputformat.addinputpath (conf, new Path ("/test/input/t")); Fileoutputformat.setoutputpath (conf, new Path ("/test/outPut/t ")); Conf.setinputformat (Textinputformat.class); conf.setnummaptasks (1); Conf.setmapperclass ( Identitymapper.class); Conf.setmaprunnerclass (Maprunner.class); Conf.setmapoutputkeyclass (LongWritable.class); Conf.setmapoutputvalueclass (Text.class); Conf.setpartitionerclass (hashpartitioner.class); conf.setNumReduceTasks (1); Conf.setreducerclass (Identityreducer.class); Conf.setoutputkeyclass (Longwritable.class); Conf.setoutputvalueclass (Text.class); Conf.setoutputformat (Textoutputformat.class); Jobclient.runjob (conf); return 0;} public static void Main (string[] args) throws Exception {int exitCode = Toolrunner.run (New Minimalmapreducewithdefaults () , args); System.exit (ExitCode);}}

Input shards

An input shard (split) is an input block that is processed by a single map.

The MapReduce application developer does not have to deal directly with Inputsplit because it was created by InputFormat.

The InputFormat is responsible for generating the input shards and cutting them into records.

How to control the size of a shard

Avoid slicing

Import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.mapred.textinputformat;public class Nosplittabletextinputformat extends Textinputformat {@ overrideprotected boolean issplitable (FileSystem fs,path file) {return false;}}

Treat the entire file as a record

Import Java.io.ioexception;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.fsdatainputstream;import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.Path; Import Org.apache.hadoop.io.byteswritable;import Org.apache.hadoop.io.ioutils;import Org.apache.hadoop.io.nullwritable;import Org.apache.hadoop.mapred.fileinputformat;import Org.apache.hadoop.mapred.filesplit;import Org.apache.hadoop.mapred.inputsplit;import Org.apache.hadoop.mapred.jobconf;import Org.apache.hadoop.mapred.recordreader;import Org.apache.hadoop.mapred.reporter;public class Wholefileinputformat Extendsfileinputformat<nullwritable, byteswritable> {@Overrideprotected boolean issplitable (FileSystem FS, Path file) {return false;} @Overridepublic recordreader<nullwritable, byteswritable> getrecordreader (inputsplit split, jobconf job, Reporter Reporter) throws IOException {return new Wholefilerecordreader ((filesplit) split, Job);}} Class Wholefilerecordreader IMPLEMENTSRECORDREADER&LT Nullwritable, byteswritable> {private Filesplit filesplit;private Configuration conf;private Boolean processed = False;public Wholefilerecordreader (Filesplit filesplit, Configuration conf) {this.filesplit = filesplit;this.conf = conf;} @Overridepublic void Close () throws IOException {} @Overridepublic nullwritable CreateKey () {return nullwritable.get ();} @Overridepublic byteswritable CreateValue () {return new byteswritable ();} @Overridepublic long GetPos () throws IOException {return processed? Filesplit.getlength (): 0;} @Overridepublic float getprogress () throws IOException {return processed? 1.0f:0.0f;}  @Overridepublic Boolean Next (nullwritable key, byteswritable value) throws IOException {if (!processed) {byte[] contents = New byte[(int) filesplit.getlength ()]; Path file = Filesplit.getpath (); FileSystem fs = File.getfilesystem (conf); Fsdatainputstream in = null;try {in = Fs.open (file); ioutils.readfully (in, contents, 0, Contents.length); Value.set ( Contents, 0, contents.length);} Finally {Ioutils.closestream (in);} processed = True;return true;} return false;}}

Input format

Hierarchy of the InputFormat class

Fileinputformat class

Fileinputformat is the base class for all InputFormat implementations that use files as data sources, and it provides two features: a definition of which files are included in the input of a job, and an implementation that generates shards for the input file. The job that divides the shards into records is completed by its subclasses.

Textinputformat

Textinputformat is the default InputFormat. Each record is a line of input.

The key is the longwritable type that stores the byte offset of the row in the entire file. The value is the contents of this line. Does not contain a terminator (line feed and carriage return), which is of type text.

Keyvaluetextinputformat

Typically, each row of a file sheet is a key-value pair. Use a delimiter to separate it. For example, tabs. The delimiter can be specified by the Key.value.separator.in.input.line property. Its default value is a tab character.

Nlineinputformat

Suppose you want the map to receive input for a fixed number of rows. Need to use Nlineinputformat.

The same as Textinputformat. The key is the byte offset of the row in the file, and the value is the row itself. The Mapred.line.input.format.linespermap property controls the value of N. The default is 1.

Binary input

Sequencefileinputformat, Sequencefileastextinputformat, Sequencefileasbinaryinputformat.

Multiple inputs

Multiple inputs, specifying a mapper for each input, and, of course, multiple input formats with just one mapper.

Output format

Hierarchy of the OutputFormat class

And the input corresponds, the output has several types such as the following:

Text output, binary output, multiple outputs, deferred output, database output.

MapReduce in actual programming "I/O"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MapReduce in actual programming "I/O"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

MapReduce in actual programming "I/O"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support