MapReduce scheduling and execution Principles series articles

Source: Internet
Author: User

Transferred from: http://blog.csdn.net/jaytalent?viewmode=contents

MapReduce scheduling and execution Principles series articles

First, the MapReduce scheduling and execution principle of the work submitted

Second, the MapReduce scheduling and execution principle of job initialization

Third, the task scheduling of MapReduce dispatching and executing principle

Iv. task scheduling of the MapReduce scheduling and execution Principle (cont.)

The purpose of this paper is to clarify the whole life cycle process of a mapreduce job (job) in Hadoop after being submitted to the framework, and to summarize and refer to it later, if you have any questions, please enlighten me. This article does not address the architecture design of Hadoop, please refer to the relevant books and literature if you are interested. In the process of grooming, I am interested in some of the source code will be research and study, in order to strengthen the foundation.
Author : jaytalent
Start date : September 9, 2013 Reference:"1" Hadoop Technology Insider-deep analysis of MapReduce architecture design and implementation principles Dong Xicheng "2" Hadoop 1.0.0 source
"3" Hadoop Technology Insider-deep analysis of Hadoop Common and HDFS architecture design and implementation principles Cai Bin Chen Yiping The life cycle of a mapreduce job is broadly divided into 5 stages "1": 1. job submission and initialization2. Task scheduling and Monitoring 3. Task run Environment Preparation 4. Task Execution 5. Job completion is now learned. Because job submissions are done on the client side, and initialization is done in Jobtracker, this article focuses only on the former, which is left to the next article for study. first, job submission and initializationTake the WordCount job as an example, first look at the code snippet submitted by the job: [Java]View Plaincopy
  1. Job Job = New Job (conf, "word count");
  2. Job.setjarbyclass (WordCount.   Class);
  3. Job.setmapperclass (tokenizermapper.   Class);
  4. Job.setcombinerclass (intsumreducer.   Class);
  5. Job.setreducerclass (intsumreducer.   Class);
  6. Job.setoutputkeyclass (Text.   Class);
  7. Job.setoutputvalueclass (intwritable.   Class);
  8. Fileinputformat.addinputpath (Job, new Path (otherargs[0]));
  9. Fileoutputformat.setoutputpath (Job, new Path (otherargs[1]));
  10. System.exit (Job.waitforcompletion (true)?  0: 1);
The new MapReduce API is used here. The Job.waitforcompletion (true) function call begins the job submission process. Next, call: Job.submit---jobclient.submitjobinternal method to really implement job submission. In the Jobclient.submitjobinternal method, there are mainly the following preparations: 1. Get Job ID [Java]View Plaincopy
    1. JobID JobID = Jobsubmitclient.getnewjobid ();
The job ID is obtained from Jobtracker, which is an RPC call, defined by Getnewjobid, in the Jobsubmissionprotocol interface.
[Java]View Plaincopy
    1. Private Jobsubmissionprotocol jobsubmitclient;
The RPC mechanism for Hadoop is based on dynamic proxy implementations. The client code uses the proxy object provided by the RPC class to invoke the server's method. MapReduce defines a series of protocol interfaces for RPC communication. These agreements include: a. Jobsubmissionprotocolb. Refreshusermappingsprotocolc. Refreshauthorizationpolicyprotocold. Adminoperationsprotocole. Intertrackerprotocolf. Taskumbilicalprotocol The first four protocols are used for the client, and the last two protocols are located inside the MapReduce. The Getnewjobid method used here is defined by the protocol Jobsubmissionprotocol: [Java]View Plaincopy
    1. /**
    2. * Allocate a name for the job.
    3. * @return a unique job name for submitting jobs.
    4. * @throws IOException
    5. */
    6. Public JobID Getnewjobid () throws IOException;
Users use this protocol to submit jobs through Jobtracker, view job status, and so on. 2. Job File UploadJobclient uploads the required files for the job to the Jobtracker file system, usually HDFs, based on the job configuration information. Configuration information is maintained by the Jobconf class object. In the new API, the Jobconf object is part of the Jobcontext object, and the job class job inherits from the Jobcontext class. Before uploading a file, you need to create the necessary directories on HDFs. The exact process of uploading a file begins with the call to this line from the Jobclient.submitjobinternal method: [Java]View Plaincopy
    1. Copyandconfigurefiles (Jobcopy, Submitjobdir);
After the information such as the number of committed replicas (mapred.submit.replication, default 10) is configured, the main code analysis is as follows (some logs and exception handling are omitted for clarity purposes): [Java]View Plaincopy
    1. Retrieve command line arguments placed into the jobconf
    2. by Genericoptionsparser.
    3. String files = job.get ("tmpfiles");
    4. String libjars = Job.get ("Tmpjars");
    5. String archives = job.get ("tmparchives");
First, you get the names and paths of different types of files from the configuration, which are specified from the command line (Hadoop Shell) when the job is submitted. Files represent common files that the job relies on, such as text files; Libjars represents the third-party jar package that the application relies on, and archives is a compressed file that is packaged with multiple files used by the application. [Java]View Plaincopy
    1. // create a number of  filenames in the jobtracker ' s fs namespace  
    2. Filesystem fs = submitjobdir.getfilesystem (Job);   
    3. submitJobDir  = fs.makequalified (submitjobdir);   
    4. fspermission mapredsysperms =  new fspermission (jobsubmissionfiles.job_dir_permission);   
    5. filesystem.mkdirs (fs, submitjobdir, mapredsysperms);   
    6. path filesdir = jobsubmissionfiles.getjobdistcachefiles (submitjobdir);   
    7. path archivesdir = jobsubmissionfiles.getjobdistcachearchives (SubmitJobDir);   
    8. path libjarsdir = jobsubmissionfiles.getjobdistcachelibjars (submitJobDir);   
Next, a series of file path names are created in the namespace of the Jobtracker file system (typically HDFs), which includes the three file types described above.
With the pathname, create a path on the HDFs and copy the files to the corresponding directory with the following code: [Java]View Plaincopy
  1. Add all command line Files/jars and archive
  2. First copy them to jobtrackers filesystem
  3. if (Files! = null) {
  4. Filesystem.mkdirs (FS, Filesdir, mapredsysperms);
  5. string[] Filearr = Files.split (",");
  6. For (String Tmpfile:filearr) {
  7. URI Tmpuri;
  8. Tmpuri = new URI (tmpfile);
  9. Path TMP = new Path (Tmpuri);
  10. Path NewPath = copyremotefiles (Fs,filesdir, TMP, Job, replication);
  11. URI Pathuri = Getpathuri (NewPath, Tmpuri.getfragment ());
  12. Distributedcache.addcachefile (Pathuri, job);
  13. Distributedcache.createsymlink (Job);
  14. }
  15. }
  16. if (libjars! = null) {
  17. Filesystem.mkdirs (FS, Libjarsdir, mapredsysperms);
  18. string[] Libjarsarr = Libjars.split (",");
  19. For (String Tmpjars:libjarsarr) {
  20. Path TMP = new Path (tmpjars);
  21. Path NewPath = Copyremotefiles (FS, Libjarsdir, TMP, Job, replication);
  22. Distributedcache.addarchivetoclasspath
  23. (New Path (Newpath.touri (). GetPath ()), job, FS);
  24. }
  25. }
  26. if (archives! = null) {
  27. Filesystem.mkdirs (FS, Archivesdir, mapredsysperms);
  28. string[] Archivesarr = Archives.split (",");
  29. For (String Tmparchives:archivesarr) {
  30. URI Tmpuri;
  31. Tmpuri = new URI (tmparchives);
  32. Path TMP = new Path (Tmpuri);
  33. Path NewPath = Copyremotefiles (FS, Archivesdir, TMP, Job, replication);
  34. URI Pathuri = Getpathuri (NewPath, Tmpuri.getfragment ());
  35. Distributedcache.addcachearchive (Pathuri, job);
  36. Distributedcache.createsymlink (Job);
  37. }
Note that the upload and download of the MapReduce job file is done through the Distributedcache tool, which is a data distribution tool. The user-specified files are distributed to the individual tasktracker to run the task. The details of the tool are not covered here, and are reserved for future discussion. Finally, copy the jar file corresponding to the job into HDFs: [Java]View Plaincopy
  1. String Originaljarpath = Job.getjar ();
  2. if (originaljarpath! = null) { //copy jar to Jobtracker ' s FS
  3. //Use jar name if job was not named.
  4. if ("". Equals (Job.getjobname ())) {
  5. Job.setjobname (new Path (Originaljarpath). GetName ());
  6. }
  7. Path submitjarfile = Jobsubmissionfiles.getjobjar (Submitjobdir);
  8. Job.setjar (Submitjarfile.tostring ());
  9. Fs.copyfromlocalfile (new Path (Originaljarpath), submitjarfile);
  10. Fs.setreplication (submitjarfile, replication);
  11. Fs.setpermission (Submitjarfile,
  12. new Fspermission (jobsubmissionfiles.job_file_permission));
  13. }
Note that each time a type of file is uploaded, the path to the file is configured in the Jobconf object, which is done by [Java]View Plaincopy
    1. Distributedcache.addcachefile (Pathuri, job);
    2. Distributedcache.addarchivetoclasspath (New Path (Newpath.touri (). GetPath ()), job, FS);
    3. Distributedcache.addcachearchive (Pathuri, job);
    4. Job.setjar (Submitjarfile.tostring ());
These four lines of code are complete. Incidentally, the path class Hadoop file system abstracts the Paths "3" in the file system on the basis of Java.net.URI. Java's file class and URL classes abstract different things separately, and path can be said to unify them.
3. Generate the Inputsplit fileJobclient calls InputFormat's Getsplits method to generate Inputsplit related information from the input file submitted by the user. [Java]View Plaincopy
    1. Create the splits for the job
    2. FileSystem fs = Submitjobdir.getfilesystem (jobcopy);
    3. int maps = writesplits (context, submitjobdir);
    4. Jobcopy.setnummaptasks (maps);
Jobcopy is a jobconf object. Where the Writesplits method actually calls the Inputsplit.getsplits method to generate splits information and writes splits raw information and meta information to the corresponding directory and file in HDFs. The generation process for split will be studied later, and this will not unfold. Finally, the Jobconf object corresponding to the job is written in the form of an XML configuration file into HDFs: [Java]View Plaincopy
    1. Write job file to Jobtracker ' s FS
    2. Fsdataoutputstream out =
    3. Filesystem.create (FS, Submitjobfile,
    4. new Fspermission (jobsubmissionfiles.job_file_permission));
    5. try {
    6. Jobcopy.writexml (out);
    7. } finally {
    8. Out.close ();
    9. }
At this point, the job file upload is officially completed.
Next, the job will be submitted to Jobtracker, please follow the following article: operation initialization of MapReduce scheduling and execution principle

MapReduce scheduling and execution Principles series articles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.