Through this mapreduce analysis model. Deepen the mapreduce understanding model; and the demo Mapreduc into the programming model is a common lattice type and output lattice formula, in which we are able to expand their input lattice formulas, examples: We need to use MONGO data as input, can expand InputFormat, Inputsplit the way it is implemented.
MapReduce model in-depth understanding
We already know that the inputs and outputs of the map and the reduce functions are key-value pairs, and we start with an in-depth look at the model.
First of all. Analyze a default MapReduce job program.
(1) One of the simplest mapreduce programs
Import Org.apache.hadoop.conf.configured;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.mapred.fileinputformat;import Org.apache.hadoop.mapred.fileoutputformat;import Org.apache.hadoop.mapred.jobclient;import Org.apache.hadoop.mapred.jobconf;import Org.apache.hadoop.util.Tool; Import Org.apache.hadoop.util.toolrunner;public class Minimalmapreduce extends configured implements Tool {@ overridepublic int Run (string[] args) throws Exception {jobconf conf = new jobconf (getconf (), getclass ()); Fileinputformat.addinputpath (conf, new Path ("/test/input/t")); Fileoutputformat.setoutputpath (conf, new Path ("/test/output/t")); Jobclient.runjob (conf); return 0;} public static void Main (string[] args) throws Exception {int exitCode = Toolrunner.run (New Minimalmapreduce (), args); System.exit (ExitCode);}}
(2) function as above, default value display setting
Import Org.apache.hadoop.conf.configured;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapred.fileinputformat;import Org.apache.hadoop.mapred.fileoutputformat;import Org.apache.hadoop.mapred.jobclient;import Org.apache.hadoop.mapred.jobconf;import Org.apache.hadoop.mapred.maprunner;import Org.apache.hadoop.mapred.textinputformat;import Org.apache.hadoop.mapred.textoutputformat;import Org.apache.hadoop.mapred.lib.hashpartitioner;import Org.apache.hadoop.mapred.lib.identitymapper;import Org.apache.hadoop.mapred.lib.identityreducer;import Org.apache.hadoop.util.tool;import Org.apache.hadoop.util.toolrunner;public class Minimalmapreducewithdefaults Extends configured implements Tool {@Overridepublic int run (string[] args) throws Exception {jobconf conf = new Jobconf (ge Tconf (), getclass ()); Fileinputformat.addinputpath (conf, new Path ("/test/input/t")); Fileoutputformat.setoutputpath (conf, new Path ("/test/outPut/t ")); Conf.setinputformat (Textinputformat.class); conf.setnummaptasks (1); Conf.setmapperclass ( Identitymapper.class); Conf.setmaprunnerclass (Maprunner.class); Conf.setmapoutputkeyclass (LongWritable.class); Conf.setmapoutputvalueclass (Text.class); Conf.setpartitionerclass (hashpartitioner.class); conf.setNumReduceTasks (1); Conf.setreducerclass (Identityreducer.class); Conf.setoutputkeyclass (Longwritable.class); Conf.setoutputvalueclass (Text.class); Conf.setoutputformat (Textoutputformat.class); Jobclient.runjob (conf); return 0;} public static void Main (string[] args) throws Exception {int exitCode = Toolrunner.run (New Minimalmapreducewithdefaults () , args); System.exit (ExitCode);}}
Input shards
An input shard (split) is an input block that is processed by a single map.
The MapReduce application developer does not have to deal directly with Inputsplit because it was created by InputFormat.
The InputFormat is responsible for generating the input shards and cutting them into records.
How to control the size of a shard
Avoid slicing
Import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.mapred.textinputformat;public class Nosplittabletextinputformat extends Textinputformat {@ overrideprotected boolean issplitable (FileSystem fs,path file) {return false;}}
Treat the entire file as a record
Import Java.io.ioexception;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.fsdatainputstream;import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.Path; Import Org.apache.hadoop.io.byteswritable;import Org.apache.hadoop.io.ioutils;import Org.apache.hadoop.io.nullwritable;import Org.apache.hadoop.mapred.fileinputformat;import Org.apache.hadoop.mapred.filesplit;import Org.apache.hadoop.mapred.inputsplit;import Org.apache.hadoop.mapred.jobconf;import Org.apache.hadoop.mapred.recordreader;import Org.apache.hadoop.mapred.reporter;public class Wholefileinputformat Extendsfileinputformat<nullwritable, byteswritable> {@Overrideprotected boolean issplitable (FileSystem FS, Path file) {return false;} @Overridepublic recordreader<nullwritable, byteswritable> getrecordreader (inputsplit split, jobconf job, Reporter Reporter) throws IOException {return new Wholefilerecordreader ((filesplit) split, Job);}} Class Wholefilerecordreader IMPLEMENTSRECORDREADER< Nullwritable, byteswritable> {private Filesplit filesplit;private Configuration conf;private Boolean processed = False;public Wholefilerecordreader (Filesplit filesplit, Configuration conf) {this.filesplit = filesplit;this.conf = conf;} @Overridepublic void Close () throws IOException {} @Overridepublic nullwritable CreateKey () {return nullwritable.get ();} @Overridepublic byteswritable CreateValue () {return new byteswritable ();} @Overridepublic long GetPos () throws IOException {return processed? Filesplit.getlength (): 0;} @Overridepublic float getprogress () throws IOException {return processed? 1.0f:0.0f;} @Overridepublic Boolean Next (nullwritable key, byteswritable value) throws IOException {if (!processed) {byte[] contents = New byte[(int) filesplit.getlength ()]; Path file = Filesplit.getpath (); FileSystem fs = File.getfilesystem (conf); Fsdatainputstream in = null;try {in = Fs.open (file); ioutils.readfully (in, contents, 0, Contents.length); Value.set ( Contents, 0, contents.length);} Finally {Ioutils.closestream (in);} processed = True;return true;} return false;}}
Input format
Hierarchy of the InputFormat class
Fileinputformat class
Fileinputformat is the base class for all InputFormat implementations that use files as data sources, and it provides two features: a definition of which files are included in the input of a job, and an implementation that generates shards for the input file. The job that divides the shards into records is completed by its subclasses.
Textinputformat
Textinputformat is the default InputFormat. Each record is a line of input.
The key is the longwritable type that stores the byte offset of the row in the entire file. The value is the contents of this line. Does not contain a terminator (line feed and carriage return), which is of type text.
Keyvaluetextinputformat
Typically, each row of a file sheet is a key-value pair. Use a delimiter to separate it. For example, tabs. The delimiter can be specified by the Key.value.separator.in.input.line property. Its default value is a tab character.
Nlineinputformat
Suppose you want the map to receive input for a fixed number of rows. Need to use Nlineinputformat.
The same as Textinputformat. The key is the byte offset of the row in the file, and the value is the row itself. The Mapred.line.input.format.linespermap property controls the value of N. The default is 1.
Binary input
Sequencefileinputformat, Sequencefileastextinputformat, Sequencefileasbinaryinputformat.
Multiple inputs
Multiple inputs, specifying a mapper for each input, and, of course, multiple input formats with just one mapper.
Output format
Hierarchy of the OutputFormat class
And the input corresponds, the output has several types such as the following:
Text output, binary output, multiple outputs, deferred output, database output.
Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.
MapReduce in actual programming "I/O"