In Mrappmaster, when the MapReduce job is initialized, the MapReduce job is initialized by the inittransition Transition () method in the job state machine Jobimpl, which includes the following:
1. Call the Createsplits () method, create the Shard, and get the task Shard metadata information tasksplitmetainfo array tasksplitmetainfo;
2. Determine the number of map tasks Nummaptasks: The length of the array of shard metadata information, that is, how many shards there are nummaptasks;
3, determine the number of reduce task Numreducetasks, take the job parameters mapreduce.job.reduces, parameters are not configured by default to 0;
4, according to the Shard meta-data information calculation input length inputlength, that is, the job size;
5, according to the job size inputlength, call the Makeuberdecision () method of the job, decide whether the operation mode is Uber mode or Non-uber mode.
The relevant key codes are as follows:
//Call CREATESP Lits () method, create shards, and get task Shard metadata information Tasksplitmetainfo array tasksplitmetainfo tasksplitmetainfo[] Tasksplitmetainfo = Createsplit S (Job, job.jobid); Determine the number of map tasks Nummaptasks: The length of the array of shard metadata information, that is, how many shards there are nummaptasks job.nummaptasks = tasksplitmetainfo.length; Determine the number of reduce tasks numreducetasks, take the job parameter mapreduce.job.reduces, the parameter is not configured by default to 0 job.numreducetasks = job.conf.getInt (Mrjobcon Fig. num_reduces, 0); Omit part of code//Calculate input length inputlength based on shard metadata information, that is, job size long inputlength = 0; for (int i = 0; i < job.nummaptasks; ++i) {inputlength + = Tasksplitmetainfo[i].getinputdatalength (); }//Based on job size inputlength, call the Makeuberdecision () method of the job to determine whether the job run mode is Uber mode or Non-uber mode job.makeuberdecision (inputlengt h);
As a result, we can see that the operation of Uber or Non-uber is determined by the job's Makeuberdecision () method, which is passed in to Inputlength, which we'll look at in this article, that is, how to determine how the job works. Uber or Non-uber?
In the "Yarn Source analysis Mrappmaster: operation mode Local, Uber, Non-uber" in the article we understand the Uber and non-uber two ways of operation of the meaning, as follows:
1. Uber mode: A pattern designed to reduce the latency of small jobs, where all tasks, whether map task or reduce task, are executed sequentially in the same container This container is in fact also mrappmaster is located container;
2, Non-uber mode: For long-running large jobs, the map task request for resources, when the map task runs to a certain percentage of the completion of the reduce task request resources.
After deciding what to look for, let's look at the job's Makeuberdecision () method, which is implemented as the Jobimpl class with the Makeuberdecision () method code as follows:
/** * Decide Whether job can be run with Uber mode based on various criteria. * @param datainputlength total length for all splits */private void makeuberdecision (long datainputlength) {//FIXM E:need new Memory criterion for uber-decision (oops, too late here; Until am-resizing supported,//must depend on job client to pass Fat-slot needs)//These is no longer "system" settings, necessarily; User may override//Get maximum number of map tasks allowed in Uber mode sysmaxmaps,//parameter mapreduce.job.ubertask.maxmaps, parameter not configured defaults to 9 int sysmaxmaps = Conf.getint (Mrjobconfig.job_ubertask_maxmaps, 9); Get the maximum number of reduce tasks allowed in Uber mode sysmaxreduces,//Take parameter mapreduce.job.ubertask.maxreduces, parameter not configured defaults to 1 int sysmaxreduces = conf . GetInt (mrjobconfig.job_ubertask_maxreduces, 1); Get the maximum number of bytes of data that are allowed in the Uber mode of the system sysmaxbytes,// Mapreduce.job.ubertask.maxbytes, the parameter is not configured by default for the remote job submission path Remotejobsubmitdir The file system's default chunk size long sysmaxbytes = Conf.getlong ( Mrjobconfig.job_ubertask_maxbytes, Fs.getdefaultblocksize (This.remotejobsubmitdir)); Fixme:this is wrong; Get FS from//[File?] InputFormat and default block size//from that//get the memory resource unit slot slot size that the system has set for Uber mode Sysmems Izeforuberslot,//Take parameter YARN.APP.MAPREDUCE.AM.RESOURCE.MB, parameter not configured defaults to 1536M long Sysmemsizeforuberslot = Conf.getin T (MRJOBCONFIG.MR_AM_VMEM_MB, MRJOBCONFIG.DEFAULT_MR_AM_VMEM_MB); Get the CPU Resource unit slot slot size Syscpusizeforuberslot,//Take parameter yarn.app.mapreduce.am.resource.cpu-vcores, parameter not configured default to 1 for Uber mode Long Syscpusizeforuberslot = Conf.getint (Mrjobconfig.mr_am_cpu_vcores, Mrjobconfig.default_mr_am_cpu_vco RES); Gets whether the system allows Uber mode flag bit uberenabled,//Take parameter mapreduce.job.ubertask.enable, parameter is not configured by default to False, Boolean uberenabled is not enabled = Conf.getboolean (mrjobconfig.job_ubertask_enable, false); Determine if the map task number meets the restrictions set by the system for the Uber mode, and the result is assigned to smallnummaptasks boolean smallnummaptasks = (nummaptasks <= sysmaxmaps); Determine the number of reduce tasks isThe Smallnumreducetasks boolean smallnumreducetasks = (numreducetasks <= sysmaxreduces) is assigned to the system with the restrictions set for the Uber mode; Determines whether the task contains a data volume that meets the restrictions set by the system for the Uber mode and assigns the result to smallinput boolean smallinput = (datainputlength <= sysmaxbytes); Ignoring overhead due to uberam and statics as negligible here://Get the memory size required for the map task of the system configuration REQUIREDMAPMB,//Take parameter MA PREDUCE.MAP.MEMORY.MB, the parameter is not configured by default to 0 long REQUIREDMAPMB = Conf.getlong (mrjobconfig.map_memory_mb, 0); Get System Configuration Map task requires memory size REQUIREDREDUCEMB,//Take parameter MAPREDUCE.REDUCE.MEMORY.MB, parameter not configured defaults to 0 long requiredreducemb = conf.getl Ong (MRJOBCONFIG.REDUCE_MEMORY_MB, 0); Calculate the required task memory size REQUIREDMB,//Fetch the memory size required by the map task REQUIREDMAPMB the larger of the memory size required by the reduce task requiredreducemb long REQUIREDMB = Math. Max (REQUIREDMAPMB, REQUIREDREDUCEMB); Get the number of CPU cores required by the map task in Uber mode requiredmapcores,//Take parameter mapreduce.map.cpu.vcores, parameter not configured defaults to 1 int requiredmapcores = Conf.getint (Mrjobconfig.map_cpu_vcores, MRJOBCONFIG.DEFAULT_MAP_CPU_vcores); Get the number of CPU cores required by the reduce task in Uber mode requiredreducecores,//Take parameter mapreduce.reduce.cpu.vcores, parameter not configured default to 1 int Requiredreducecores = Conf.getint (Mrjobconfig.reduce_cpu_vcores, MRJOBCONFIG.DEFAULT_REDUCE_CPU_VC Ores); Compute the required task requires CPU requiredcores,//Fetch the number of CPU cores required by the map task Requiredmapcores the larger of the CPU cores required by the reduce task Requiredreducecores int requ Iredcores = Math.max (Requiredmapcores, requiredreducecores); Special handling: If the number of reduce tasks is 0, that is, when the map-only task,//required memory size, CPU cores, the map task requires the quasi if (numreducetasks = = 0) {REQUIREDMB = req UIREDMAPMB; Requiredcores = Requiredmapcores; }//When a task in the Mr Job requires a memory size of requiredmb less than or equal to the memory resource unit slot slot size Sysmemsizeforuberslot that the system sets for Uber mode,//or Sysmemsizeforuberslot is set is unrestricted,//is determined as a small memory requirement, that is, the flag bit smallmemory is true boolean smallmemory = (requiredmb <= sysmemsizeforuberslot) || (Sysmemsizeforuberslot = = Jobconf.disabled_memory_limit); When the number of CPU cores required by a task in the Mr Job requiredcores is less than or equal to the system set for Uber mode, the CPU resource cell slot slot size SyScpusizeforuberslot,//determined as a small CPU requirement, that is, the flag bit SMALLCPU is true boolean smallcpu = Requiredcores <= syscpusizeforuberslot; Determines whether the job is a chained job and assigns a value to the flag bit notchainjob,ture for the non-chained job, false for the chained job Boolean notchainjob =!ischainjob (conf); User have overall veto power over uberization, or user can modify//limits (overriding system settings and potential Ly shooting//themselves in the head). Note that Chainmapper/reducer is//fundamentally incompatible with MR-1220; They employ a blocking//queue between the maps/reduces and thus require parallel execution,//while "Uber-am" (MR AM + localcontainerlauncher) loops over tasks//and thus requires sequential execution. Determine whether Uber mode, assign to Isuber,//judgment is based on, the following seven conditions must all meet://1, the parameter mapreduce.job.ubertask.enable is configured to True, that is, the system allows Uber mode;//2, M The number of AP tasks meets the restrictions set by the system for Uber mode, which is less than or equal to the value of the parameter Mapreduce.job.ubertask.maxmaps configuration and should be less than or equal to 9 if the parameter is not configured;//3, The reduce task number satisfies the system's restrictions set for Uber mode, which is less than or equal to the value of the parameter mapreduce.job.ubertask.maxreduces configuration, and if the parameter is not configured, it should beThis is less than or equal to 1, and//4, the task contains the amount of data that meets the system's restrictions set for Uber mode, That is, the task data amount is less than or equal to the value of the parameter mapreduce.job.ubertask.maxbytes configuration, if the parameter is not configured, it should be less than or equal to the default chunk size of the file system Remotejobsubmitdir the remote job submission path.//5, A task in the Mr Job requires a memory size of requiredmb less than or equal to the memory resource unit slot slot size Sysmemsizeforuberslot The system is set for Uber mode, or Sysmemsizeforuberslot is set to unrestricted; 6. The number of CPU cores required for the task in Mr Job requiredcores is less than or equal to the CPU resource unit slot slot size Syscpusizeforuberslot set for the Uber mode;//7, the job is a non-chained job; isuber = Uberenab LED && smallnummaptasks && smallnumreducetasks && smallinput && smallmemory &&A mp Smallcpu && Notchainjob; if (isuber) {//When the job is running in Uber mode, set some necessary parameters Log.info ("uberizing job" + jobId + ":" + nummaptasks + "m+" + NUMR Educetasks + "R tasks (" + datainputlength + "input bytes) would run sequentially on single node."); Make sure reduces is scheduled only after all map is completed//mapreduce.job.reduce.slowstart.completedmaps parameter Set to 1,//That is, all map tasks are completed before the resource conf.setfloat for the reduce task is assigned (mrjobconfig.completEd_maps_for_reduce_slowstart, 1.0f); Uber-subtask attempts all get launched on same node; If one fails,//probably should retry elsewhere, i.e., move entire Uber-am:ergo,//limit attempts to 1 (or a T most 2? Probably not ...) The maximum number of attempts for the Map and reduce tasks is 1 conf.setint (mrjobconfig.map_max_attempts, 1); Conf.setint (mrjobconfig.reduce_max_attempts, 1); Disable speculation//disable the speculative execution mechanism of MAP, reduce task Conf.setboolean (mrjobconfig.map_speculative, false); Conf.setboolean (mrjobconfig.reduce_speculative, false); } else {//when the job is in Non-uber mode, the output job cannot be determined as the reason for Uber mode, according to the above 7 flag bits StringBuilder msg = new StringBuilder (); Msg.append ("Not uberizing"). Append (JobId). Append ("because:"); if (!uberenabled)//Uber mode switch is not turned on, this mode is disabled for msg.append ("not enabled;"); if (!smallnummaptasks)//There are too many map tasks msg.append ("too many maps;"); if (!smallnumreducetasks)//There are too many reduce task mSg.append ("Too many reduces;"); if (!smallinput)//have too large input msg.append ("too much input;"); if (!SMALLCPU)//need to consume excessive CPU msg.append ("Too much CPU;"); if (!smallmemory)//need to occupy too much memory msg.append ("Too much RAM;"); if (!notchainjob)//is a chained job and cannot be used with Uber mode msg.append ("chainjob;"); Log information that cannot be used in Uber mode Log.info (msg.tostring ()); } }
The logic of the Makeuberdecision () method is very clear, but involves a lot of judging conditions, parameters, in general, a mapreduce is used in Uber mode or Non-uber mode operation, to consider the following 7 factors, these conditions are indispensable:
1. The parameter mapreduce.job.ubertask.enable is configured to True, that is, the system allows Uber mode, which is a switch in Uber mode;
2, the map task number satisfies the system to the Uber mode to set the limit condition, namely is less than equals the parameter mapreduce.job.ubertask.maxmaps configuration the value, if the parameter is not configured, should be less than equals 9;
3, reduce the number of tasks to meet the system for Uber mode restrictions, that is less than equal to the parameter Mapreduce.job.ubertask.maxreduces configuration value, if the parameters are not configured, it should be less than or equal to 1;
4, the task contains the amount of data size to meet the system for the Uber mode restrictions set, That is, the task data amount is less than or equal to the value of the parameter mapreduce.job.ubertask.maxbytes configuration, if the parameter is not configured, it should be less than or equal to the default data block size of the file system Remotejobsubmitdir the remote job submission path.
5. The memory size of the task required in the Mr Job REQUIREDMB is less than or equal to the memory resource unit slot slot size Sysmemsizeforuberslot set for Uber mode, or Sysmemsizeforuberslot is set to unrestricted;
6. The number of CPU cores required for the task in Mr Jobs is requiredcores less than the CPU resource unit slot slot size Syscpusizeforuberslot The system is set for Uber mode;
7, the operation is non-chain operation.
The first 6 conditions are clearly described in the above description and in the Makeuberdecision () method code and in their comments, and are available to readers on their own.
Below, we focus on the 7th condition: The job is a non-chained job, how is this condition judged? It is judged by the Ischainjob () method, with the following code:
/** * Chainmapper and Chainreducer must execute in parallel, so they ' re not * compatible with Uberization/localconta Inerlauncher (100% sequential). */Private Boolean ischainjob (Configuration conf) {Boolean ischainjob = false; try {//Get the Map class name Mapclassname, take the parameter mapreduce.job.map.class String mapclassname = Conf.get (mrjobconfig.map_clas S_ATTR); if (mapclassname! = null) {//Get the Map class class instance through the map class name Mapclassname mapclass class<?> mapclass = Class.forName ( Mapclassname); Using the IsAssignableFrom () method of class, see if Mapclass is a subclass of Chainmapper, or chainmapper,//Yes, we think it's a chained job if (chainm Apper.class.isAssignableFrom (mapclass)) Ischainjob = true; }} catch (ClassNotFoundException Cnfe) {//don ' t care; assume it's not derived from chainmapper} catch (NOCl Assdeffounderror ignored) {} try {//Get the Reduce class name reduceclassname, take arguments mapreduce.job.reduce.class String Reduceclassname = Conf.get (MRJOBCONFIG.REDUCE_CLASS_ATTR); if (reduceclassname! = null) {//Reduceclassname get the Reduce class class instance through the reduce class name Reduceclass class<?> redu Ceclass = Class.forName (reduceclassname); Using the IsAssignableFrom () method of class, see if Reduceclass is a subclass of Chainreducer, or chainreducer,//Yes, we think it's a chained job if (C HainReducer.class.isAssignableFrom (reduceclass)) Ischainjob = true; }} catch (ClassNotFoundException Cnfe) {//don ' t care; assume it's not derived from chainreducer} catch (NoC Lassdeffounderror ignored) {} return ischainjob; }
It really is to see if map or reduce is a direct or indirect subclass of chainmapper or Chainreducer, or both, by parameter Mapreduce.job.map.class, Mapreduce.job.reduce.class takes a class name and constructs a class instance with Class.forName, and then passes Class IsAssignableFrom () method to determine whether map or reduce is a direct or indirect subclass of chainmapper or Chainreducer, or both, it is so simple.
So the question is, what is a chained job? Why can't I run in Uber mode if I inherited Chainmapper or chainreducer? Let's answer each of these questions.
First, what is chained work? There are times when you find that a single mapreduce job does not fulfill your business needs, you need more mapreduce jobs to process your data, and when multiple mapreduce jobs are strung into a chain to form a larger mapreduce job, This is the chained job. One of the fundamental conditions of the chain job implementation is that its mapper or reducer inherit from Chainmapper and Chainreducer respectively.
So why can't I run in Uber mode if I inherited Chainmapper or chainreducer? Together with what is the problem of chainmapper and Chainreducer, let's do one of the most straightforward and straightforward answers, for more details please see the article on Chainmapper or Chainreducer.
First look at the implementation of the Chainmapper, within which there is a member variable chain of the chain type, defined and initialized in the Setup () method as follows:
Private Chain Chain; protected void Setup (context context) { chain = new Chain (true); Chain.setup (Context.getconfiguration ()); }
There are two most critical variables in chain, and the Mapper list mappers and the thread list threads as follows:
Private list<mapper> mappers = new arraylist<mapper> ();
Private list<thread> threads = new arraylist<thread> ();
Within the Chainmapper run () method, each mapper in the chain Mappers is added to the chain by Addmapper () method, and chain chain () The method essentially generates a Maprunner thread on a per-mapper basis, then adds it to the threads list, and then mapper starts all threads chain in threads, the key code is as follows:
Chainmapper's Run () method
public void run (context context) throws IOException, Interruptedexception {setup (context); int nummappers = Chain.getallmappers (). Size (); if (nummappers = = 0) {return; } chainblockingqueue<chain.keyvaluepair<?,? >> inputqueue; chainblockingqueue<chain.keyvaluepair<?,? >> outputqueue; if (nummappers = = 1) {chain.runmapper (context, 0); } else {//Add all the mappers with proper context//Add first mapper Outputqueue = Chain.createblockingq Ueue (); Chain.addmapper (context, outputqueue, 0); Add other mappers for (int i = 1; i < numMappers-1; i++) {inputqueue = Outputqueue; Outputqueue = Chain.createblockingqueue (); Chain.addmapper (Inputqueue, outputqueue, context, I); }//Add last Mapper Chain.addmapper (outputqueue, context, numMappers-1); }//Start All Threads chain.startallthreads (); Wait for all threads Chain.joinallthreAds (); }
chain one of the Addmapper () methods
/** * ADD Mapper that reads and writes from/to the queue * * @SuppressWarnings ("unchecked") void Addmapper (Chainblo ckingqueue<keyvaluepair<?,? >> input, chainblockingqueue<keyvaluepair<?,? >> output, Tas Kinputoutputcontext context, int index) throws IOException, interruptedexception {Configuration conf = getconf (in DEX); class<?> Keyclass = Conf.getclass (Mapper_input_key_class, Object.class); class<?> Valueclass = Conf.getclass (Mapper_input_value_class, Object.class); class<?> Keyoutclass = Conf.getclass (Mapper_output_key_class, Object.class); class<?> Valueoutclass = Conf.getclass (Mapper_output_value_class, Object.class); Recordreader rr = new Chainrecordreader (keyclass, Valueclass, input, conf); Recordwriter rw = new Chainrecordwriter (keyoutclass, Valueoutclass, Output, conf); Maprunner runner = new Maprunner (Mappers.get (index), Createmapcontext (RR, rw, context, getconf (index)), RR, rw); Threads.add (runner); }
As you can see, Chainmapper actually implements a multiple Mapper, multiple Mapper, which no longer relies on a single map task to perform a map mission, but relies on multiple map tasks to perform a variety of map missions, so It is definitely not suitable for Uber mode, as the Uber mode is limited to single-threaded serial execution of individual tasks such as map, reduce, and so on.
The same is true of chainreducer, no special explanation.
Yarn Source Analysis How to determine how the job works Uber or Non-uber?