MapReduce in actual programming "I/O"

Source: Internet
Author: User


Through this mapreduce analysis model. Deepen the mapreduce understanding model; and the demo Mapreduc into the programming model is a common lattice type and output lattice formula, in which we are able to expand their input lattice formulas, examples: We need to use MONGO data as input, can expand InputFormat, Inputsplit the way it is implemented.


MapReduce model in-depth understanding


We already know that the inputs and outputs of the map and the reduce functions are key-value pairs, and we start with an in-depth look at the model.

First of all. Analyze a default MapReduce job program.


(1) One of the simplest mapreduce programs

Import Org.apache.hadoop.conf.configured;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.mapred.fileinputformat;import Org.apache.hadoop.mapred.fileoutputformat;import Org.apache.hadoop.mapred.jobclient;import Org.apache.hadoop.mapred.jobconf;import Org.apache.hadoop.util.Tool; Import Org.apache.hadoop.util.toolrunner;public class Minimalmapreduce extends configured implements Tool {@ overridepublic int Run (string[] args) throws Exception {jobconf conf = new jobconf (getconf (), getclass ()); Fileinputformat.addinputpath (conf, new Path ("/test/input/t")); Fileoutputformat.setoutputpath (conf, new Path ("/test/output/t")); Jobclient.runjob (conf); return 0;} public static void Main (string[] args) throws Exception {int exitCode = Toolrunner.run (New Minimalmapreduce (), args); System.exit (ExitCode);}}

(2) function as above, default value display setting

Import Org.apache.hadoop.conf.configured;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapred.fileinputformat;import Org.apache.hadoop.mapred.fileoutputformat;import Org.apache.hadoop.mapred.jobclient;import Org.apache.hadoop.mapred.jobconf;import Org.apache.hadoop.mapred.maprunner;import Org.apache.hadoop.mapred.textinputformat;import Org.apache.hadoop.mapred.textoutputformat;import Org.apache.hadoop.mapred.lib.hashpartitioner;import Org.apache.hadoop.mapred.lib.identitymapper;import Org.apache.hadoop.mapred.lib.identityreducer;import Org.apache.hadoop.util.tool;import Org.apache.hadoop.util.toolrunner;public class Minimalmapreducewithdefaults Extends configured implements Tool {@Overridepublic int run (string[] args) throws Exception {jobconf conf = new Jobconf (ge Tconf (), getclass ()); Fileinputformat.addinputpath (conf, new Path ("/test/input/t")); Fileoutputformat.setoutputpath (conf, new Path ("/test/outPut/t ")); Conf.setinputformat (Textinputformat.class); conf.setnummaptasks (1); Conf.setmapperclass ( Identitymapper.class); Conf.setmaprunnerclass (Maprunner.class); Conf.setmapoutputkeyclass (LongWritable.class); Conf.setmapoutputvalueclass (Text.class); Conf.setpartitionerclass (hashpartitioner.class); conf.setNumReduceTasks (1); Conf.setreducerclass (Identityreducer.class); Conf.setoutputkeyclass (Longwritable.class); Conf.setoutputvalueclass (Text.class); Conf.setoutputformat (Textoutputformat.class); Jobclient.runjob (conf); return 0;} public static void Main (string[] args) throws Exception {int exitCode = Toolrunner.run (New Minimalmapreducewithdefaults () , args); System.exit (ExitCode);}}


Input shards


An input shard (split) is an input block that is processed by a single map.

The MapReduce application developer does not have to deal directly with Inputsplit because it was created by InputFormat.

The InputFormat is responsible for generating the input shards and cutting them into records.


How to control the size of a shard



Avoid slicing


Import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.mapred.textinputformat;public class Nosplittabletextinputformat extends Textinputformat {@ overrideprotected boolean issplitable (FileSystem fs,path file) {return false;}}

Treat the entire file as a record


Import Java.io.ioexception;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.fsdatainputstream;import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.Path; Import Org.apache.hadoop.io.byteswritable;import Org.apache.hadoop.io.ioutils;import Org.apache.hadoop.io.nullwritable;import Org.apache.hadoop.mapred.fileinputformat;import Org.apache.hadoop.mapred.filesplit;import Org.apache.hadoop.mapred.inputsplit;import Org.apache.hadoop.mapred.jobconf;import Org.apache.hadoop.mapred.recordreader;import Org.apache.hadoop.mapred.reporter;public class Wholefileinputformat Extendsfileinputformat<nullwritable, byteswritable> {@Overrideprotected boolean issplitable (FileSystem FS, Path file) {return false;} @Overridepublic recordreader<nullwritable, byteswritable> getrecordreader (inputsplit split, jobconf job, Reporter Reporter) throws IOException {return new Wholefilerecordreader ((filesplit) split, Job);}} Class Wholefilerecordreader IMPLEMENTSRECORDREADER&LT Nullwritable, byteswritable> {private Filesplit filesplit;private Configuration conf;private Boolean processed = False;public Wholefilerecordreader (Filesplit filesplit, Configuration conf) {this.filesplit = filesplit;this.conf = conf;} @Overridepublic void Close () throws IOException {} @Overridepublic nullwritable CreateKey () {return nullwritable.get ();} @Overridepublic byteswritable CreateValue () {return new byteswritable ();} @Overridepublic long GetPos () throws IOException {return processed? Filesplit.getlength (): 0;} @Overridepublic float getprogress () throws IOException {return processed? 1.0f:0.0f;}  @Overridepublic Boolean Next (nullwritable key, byteswritable value) throws IOException {if (!processed) {byte[] contents = New byte[(int) filesplit.getlength ()]; Path file = Filesplit.getpath (); FileSystem fs = File.getfilesystem (conf); Fsdatainputstream in = null;try {in = Fs.open (file); ioutils.readfully (in, contents, 0, Contents.length); Value.set ( Contents, 0, contents.length);} Finally {Ioutils.closestream (in);} processed = True;return true;} return false;}}

Input format


Hierarchy of the InputFormat class




Fileinputformat class


Fileinputformat is the base class for all InputFormat implementations that use files as data sources, and it provides two features: a definition of which files are included in the input of a job, and an implementation that generates shards for the input file. The job that divides the shards into records is completed by its subclasses.


Textinputformat


Textinputformat is the default InputFormat. Each record is a line of input.

The key is the longwritable type that stores the byte offset of the row in the entire file. The value is the contents of this line. Does not contain a terminator (line feed and carriage return), which is of type text.


Keyvaluetextinputformat


Typically, each row of a file sheet is a key-value pair. Use a delimiter to separate it. For example, tabs. The delimiter can be specified by the Key.value.separator.in.input.line property. Its default value is a tab character.


Nlineinputformat


Suppose you want the map to receive input for a fixed number of rows. Need to use Nlineinputformat.

The same as Textinputformat. The key is the byte offset of the row in the file, and the value is the row itself. The Mapred.line.input.format.linespermap property controls the value of N. The default is 1.


Binary input


Sequencefileinputformat, Sequencefileastextinputformat, Sequencefileasbinaryinputformat.


Multiple inputs


Multiple inputs, specifying a mapper for each input, and, of course, multiple input formats with just one mapper.


Output format


Hierarchy of the OutputFormat class




And the input corresponds, the output has several types such as the following:

Text output, binary output, multiple outputs, deferred output, database output.

Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.

MapReduce in actual programming "I/O"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.