Hadoop MapReduce Job Submission (client)

Source: Internet
Author: User
Tags commit file copy file upload split thread tostring file permissions hadoop mapreduce
Hadoop mapreduce jar File Upload
When submitting a job, we often execute a command similar to the following: Hadoop jar Wordcount.jar test. WordCount, and then wait for the job to complete to see the results. In the job execution process, the client uploads the jar file into HDFs, then initializes the job by JT and issues the specific task to TT, where we mainly look at the client's operations and understand how we can customize a more convenient way to submit the job. Hadoop is a shell script that executes different branches according to different parameters, and eventually calls the main function of the Org.apache.hadoop.util.RunJar class when the job commits.
The main function of Runjar can accept parameters, in the simplest case, we can execute the jar file and the main function class. In a generic process, the Hadoop script does the initialization work before executing the class, such as setting the relevant directory, setting the CLASSPATH, initializing the system variables, and so on.



In the main function will determine whether the parameter passed in correctly, unzip the jar file, enter the job's main function
public static void Main (string[] args) throws Throwable {String usage = "Runjar jarfile [MainClass] args ...";
      The number of parameters to determine if (Args.length < 1) {System.err.println (usage);
    System.exit (-1);
    } int firstarg = 0;
    String fileName = args[firstarg++];
    File File = new file (fileName);


    String mainclassname = null;
		Jarfile Jarfile;
    try {//build jar file Jarfile = new Jarfile (fileName);
    } catch (IOException io) {throw new IOException ("Error Opening job jar:" + fileName). Initcause (IO);
    }//Get List of jar files Manifest Manifest = Jarfile.getmanifest ();
    if (manifest! = null) {Mainclassname = Manifest.getmainattributes (). GetValue ("Main-class");


    } jarfile.close ();
        if (Mainclassname = = null) {if (Args.length < 2) {System.err.println (usage);
      System.exit (-1);
    } mainclassname = args[firstarg++];
    } Mainclassname = Mainclassname.replaceall ("/", "."); Get ProThe time directory for extracting the file TmpDir = new file (new Configuration (). Get ("Hadoop.tmp.dir"));
    Create an extract directory Tmpdir.mkdirs ();
      if (!tmpdir.isdirectory ()) {System.err.println ("Mkdirs failed to create" + TmpDir);
    System.exit (-1);
    }//Create working directory in temp directory final File workdir = File.createtempfile ("Hadoop-unjar", "", TmpDir);
    Workdir.delete ();
    Workdir.mkdirs ();
      if (!workdir.isdirectory ()) {System.err.println ("Mkdirs failed to create" + Workdir);
    System.exit (-1); }//Set callback function to delete working directory Runtime.getruntime () when finished. Addshutdownhook (new Thread () {public void run () {TR
          y {fileutil.fullydelete (workdir);
    } catch (IOException e) {}}});
    Unzip jar file Unjar (file, workdir);
    Set ClassPath, note that classes and Lib will join arraylist<url> ClassPath = new arraylist<url> ();
    Classpath.add (New File (workdir+ "/"). Tourl ());
    Classpath.add (File.tourl ()); Classpath.add (New File (Workdir, "classes/"). Tourl ());
    File[] libs = new File (Workdir, "Lib"). Listfiles ();
      if (libs! = null) {for (int i = 0; i < libs.length; i++) {Classpath.add (Libs[i].tourl ());
    }} ClassLoader loader = new URLClassLoader (Classpath.toarray (new url[0]);
    Get the main function and call Thread.CurrentThread () via dynamic proxy. Setcontextclassloader (loader);
    class<?> MainClass = Class.forName (Mainclassname, True, loader);
    Method main = Mainclass.getmethod ("main", new class[] {array.newinstance (string.class, 0). GetClass ()});
   string[] Newargs = arrays.aslist (args). sublist (Firstarg, Args.length). ToArray (new string[0]);
    try {//Start call, enter the main function of our job Main.invoke (NULL, new object[] {Newargs});
    } catch (InvocationTargetException e) {throw e.gettargetexception (); }
  }
Here we put a section of our own main function code, basically all the same, so only part of the
public class WordCount {public
	static void Main (string[] args) throws Exception {
		//Create a job
		Configuration Co NF = new Configuration ();
		Job Job = new Job (conf, "WordCount");
		Job.setjarbyclass (wordcount.class);


		Set input and Output type
		job.setoutputkeyclass (text.class);
		Job.setoutputvalueclass (intwritable.class);


		Set up map and reduce class
		job.setmapperclass (Wordcountmapper.class);
		Job.setreducerclass (wordcountreduce.class);


		Set the input/output stream
		fileinputformat.addinputpath (Job, New Path ("/tmp/a.txt"));
		Fileoutputformat.setoutputpath (Job, New Path ("/tmp/output"));
		Wait for job to complete
		system.exit (Job.waitforcompletion (true)? 0:1);}
}
The following function can clearly show the client's process, first submit the job, and then continue to collect job status, and finally return to the job status, where we focus on the Submit function
Public Boolean waitforcompletion (Boolean verbose
                                   ) throws IOException, Interruptedexception,
                                            classnotfoundexception {
    if (state = = Jobstate.define) {
      submit ();//Commit
    }
    if (verbose) {
      Jobclient.monitorandprintjob (conf, info);
    } else {
      info.waitforcompletion ();//Monitoring
    }
    return issuccessful ();//Return
  }
In the commit function, JT is connected and then the job is submitted
  public void Submit () throws IOException, Interruptedexception, 
                              classnotfoundexception {
    ensurestate ( Jobstate.define);
    Setusenewapi ();
    
    Connect to the Jobtracker and submit the Job
    Connect ();
    info = jobclient.submitjobinternal (conf);
    Super.setjobid (Info.getid ());
    state = jobstate.running;
   }
The emphasis is on the submitjobinternal function, in which the input and output types of the Mr are verified, the splits,maps,reduces is computed, the configuration file is uploaded, the jar file, and the split file is written
Public Runningjob submitjobinternal (final jobconf job) throws FileNotFoundException, ClassNotFoundException, Interruptedexceptio  N, IOException {/* * Configure the command line options correctly on the Submitting DFS */return Ugi.doas (New privilegedexceptionaction<runningjob> () {public Runningjob ru N () throws FileNotFoundException, ClassNotFoundException, interruptedexception, ioexception{Jo
        bconf jobcopy = job;
        Path Jobstagingarea = Jobsubmissionfiles.getstagingdir (Jobclient.this, jobcopy);
        JobID JobID = Jobsubmitclient.getnewjobid ();
        Path Submitjobdir = new Path (Jobstagingarea, jobid.tostring ());
        Jobcopy.set ("Mapreduce.job.dir", submitjobdir.tostring ());
        Jobstatus status = NULL; try {Populatetokencache (Jobcopy, Jobcopy.getcredentials ());


          The jar file, the configuration file upload is completed copyandconfigurefiles (jobcopy, Submitjobdir);
                                              Get delegation token for the Dir tokencache.obtaintokensfornamenodes (Jobcopy.getcredentials (),


          New Path [] {submitjobdir}, jobcopy);
          Path submitjobfile = Jobsubmissionfiles.getjobconfpath (Submitjobdir);
          int reduces = Jobcopy.getnumreducetasks ();
          inetaddress IP = inetaddress.getlocalhost ();
            if (IP! = null) {job.setjobsubmithostaddress (ip.gethostaddress ());
          Job.setjobsubmithostname (Ip.gethostname ());


          } jobcontext context = new Jobcontext (jobcopy, jobId);
            Detects if the output directory is legitimate if (reduces = = 0? jobcopy.getusenewmapper (): Jobcopy.getusenewreducer ()) { org.apache.hadoop.mapreduce.outputformat<?,? > Output = REFLEctionutils.newinstance (Context.getoutputformatclass (), jobcopy);
          Output.checkoutputspecs (context);
          } else {Jobcopy.getoutputformat (). Checkoutputspecs (FS, jobcopy);


          } jobcopy = (jobconf) context.getconfiguration ();
          Writes to the Shard file, generates Job.split job.splitmetainfo FileSystem fs = Submitjobdir.getfilesystem (jobcopy);
          Log.debug ("Creating splits at" + fs.makequalified (submitjobdir));
          int maps = Writesplits (context, submitjobdir);
          Jobcopy.setnummaptasks (maps)///Set map number//write "queue Admins of the queue to which job is being submitted"
          Sets the queue information by default to queue String queue = Jobcopy.getqueuename ();
          Accesscontrollist ACL = jobsubmitclient.getqueueadmins (queue); Jobcopy.set (Queuemanager.tofullpropertyname (queue, QueueACL.ADMINISTER_JOBS.getAclName ()), acl.getaclstring (


          ));    Write config file job.xml 
          Fsdataoutputstream out = Filesystem.create (FS, Submitjobfile, New Fspermission (Jo


          bsubmissionfiles.job_file_permission));
          try {jobcopy.writexml (out);
          } finally {out.close ();  }////Now, actually submit the job (using the Submit name)//Printtokens (JobId,
	   Jobcopy.getcredentials ()); True Submit Job status = Jobsubmitclient.submitjob (JobId, submitjobdir.tostring (), jobcopy.getcredential
          s ());
          Jobprofile prof = Jobsubmitclient.getjobprofile (jobId);
          if (status = NULL && prof! = null) {return new Networkedjob (status, Prof, Jobsubmitclient);
          } else {throw new IOException ("Could not launch job"); }} finally {if (status = = null) {Log.info ("cleaning up the Staging area" + Submitjobdir)
            ; if (fs! = NULL && SubmitJobdir = null) Fs.delete (Submitjobdir, true);
  }
        }
      }
    }); }
Here's a look at the process of jar and configuration file copy
private void Copyandconfigurefiles (jobconf job, Path Submitjobdir, short replication) throws IOException, Interrupt
    edexception {...//front is a series of checks for the job, the following code begins to transfer the jar file to HDFs String Originaljarpath = Job.getjar (); 
      if (Originaljarpath! = null) {//copy jar to Jobtracker's FS//Use jar name if job was not named.
      if ("". Equals (Job.getjobname ())) {Job.setjobname (new Path (Originaljarpath). GetName ());
      }//Get the jar file that needs to be uploaded Path Submitjarfile = Jobsubmissionfiles.getjobjar (Submitjobdir);
     Job.setjar (Submitjarfile.tostring ());
     Copy Fs.copyfromlocalfile (new Path (Originaljarpath), Submitjarfile) is started here;
     Set the number of copies of the jar file, multiple replicas can be set in the cluster with more nodes, reduce the TT copy file to local time fs.setreplication (submitjarfile, replication);
    Set file Permissions Fs.setpermission (Submitjarfile, New Fspermission (jobsubmissionfiles.job_file_permission));  } else {Log.warn ("No job jar file set. User classes May is not found.
      "+         "See jobconf (Class) or Jobconf#setjar (String)."); }
  }
In addition, the determination of the size of each map input split is also done in client computing, the parameters involved are mapred.min.split.size, mapred.max.split.size, calculated as follows:
  Protected long computesplitsize (long blockSize, long minSize,
                                  long maxSize) {
    return Math.max (MinSize, Math.min (MaxSize, BlockSize));
  }
So the actual split size is not necessarily the value we set, of course, this is only our most commonly used Fileinputformat, different input types will have a different way of segmentation, Fileinputformat is divided as follows:
Public list<inputsplit> getsplits (Jobcontext job) throws IOException {//Get Shard Set
  Fixed value long minSize = Math.max (Getformatminsplitsize (), getminsplitsize (Job));


  Long maxSize = getmaxsplitsize (Job);
  Generate shard data, there can be more than one input file in a job, or you can use wildcard characters list<inputsplit> splits = new arraylist<inputsplit> ();
  List<filestatus>files = Liststatus (Job); Traverse for each input file for (Filestatus file:files) {Path PATH = File.getpath ();//Get file path FileSystem fs = Path.getfilesys
    TEM (job.getconfiguration ());  Long length = File.getlen ()//Get file length//Get file Block distribution blocklocation[] Blklocations = fs.getfileblocklocations (file, 0,
    length);
      Determine if the file can be split if (length! = 0) && issplitable (Job, path)) {Long blockSize = File.getblocksize ();


      Calculate the Shard size, preceded by the formula long splitsize = Computesplitsize (BlockSize, MinSize, maxSize);
      Long bytesremaining = length;
   Start the logical split file, note that if the remaining dimension is greater than 1.1 times times the size of a shard, it will continue to be split, and this value is written dead in the current 1.x.   while ((double) bytesremaining)/splitsize > Split_slop) {//Get the current Shard block index to build filesplit int blkindex = g
        Etblockindex (Blklocations, length-bytesremaining); Add a file Shard Splits.add (new Filesplit (Path, length-bytesremaining, Splitsize, Blkl
        Ocations[blkindex].gethosts ()));
      BytesRemaining-= splitsize; }//File tail processing if (bytesremaining! = 0) {splits.add (New Filesplit (Path, length-bytesremaining, Bytesremai
      Ning, Blklocations[blklocations.length-1].gethosts ())); }} else if (length! = 0) {//If the file is inseparable, create only one split Splits.add (path, 0, length, Blklocations[0].geth
    OSTs ())); } else {//create empty hosts array for zero length files Splits.add (new Filesplit (path, 0, length, new Strin
    G[0]));


  }}//Save the number of legitimate files job.getconfiguration (). Setlong (Num_input_files, Files.size ());
  Log.debug ("Total # of Splits:" + splits.size ()); RetUrn splits; }


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.