MapReduce scheduling and execution Principles series articles

Last Update:2015-11-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://blog.csdn.net/jaytalent?viewmode=contents

First, the MapReduce scheduling and execution principle of the work submitted

Second, the MapReduce scheduling and execution principle of job initialization

Third, the task scheduling of MapReduce dispatching and executing principle

Iv. task scheduling of the MapReduce scheduling and execution Principle (cont.)

The purpose of this paper is to clarify the whole life cycle process of a mapreduce job (job) in Hadoop after being submitted to the framework, and to summarize and refer to it later, if you have any questions, please enlighten me. This article does not address the architecture design of Hadoop, please refer to the relevant books and literature if you are interested. In the process of grooming, I am interested in some of the source code will be research and study, in order to strengthen the foundation.
Author : jaytalent
Start date : September 9, 2013 Reference:"1" Hadoop Technology Insider-deep analysis of MapReduce architecture design and implementation principles Dong Xicheng "2" Hadoop 1.0.0 source

"3" Hadoop Technology Insider-deep analysis of Hadoop Common and HDFS architecture design and implementation principles Cai Bin Chen Yiping The life cycle of a mapreduce job is broadly divided into 5 stages "1": 1. job submission and initialization2. Task scheduling and Monitoring 3. Task run Environment Preparation 4. Task Execution 5. Job completion is now learned. Because job submissions are done on the client side, and initialization is done in Jobtracker, this article focuses only on the former, which is left to the next article for study. first, job submission and initializationTake the WordCount job as an example, first look at the code snippet submitted by the job: [Java]View Plaincopy

Job Job = New Job (conf, "word count");
Job.setjarbyclass (WordCount. Class);
Job.setmapperclass (tokenizermapper. Class);
Job.setcombinerclass (intsumreducer. Class);
Job.setreducerclass (intsumreducer. Class);
Job.setoutputkeyclass (Text. Class);
Job.setoutputvalueclass (intwritable. Class);
Fileinputformat.addinputpath (Job, new Path (otherargs[0]));
Fileoutputformat.setoutputpath (Job, new Path (otherargs[1]));
System.exit (Job.waitforcompletion (true)? 0: 1);

The new MapReduce API is used here. The Job.waitforcompletion (true) function call begins the job submission process. Next, call: Job.submit---jobclient.submitjobinternal method to really implement job submission. In the Jobclient.submitjobinternal method, there are mainly the following preparations: 1. Get Job ID [Java]View Plaincopy

JobID JobID = Jobsubmitclient.getnewjobid ();

The job ID is obtained from Jobtracker, which is an RPC call, defined by Getnewjobid, in the Jobsubmissionprotocol interface.
[Java]View Plaincopy

Private Jobsubmissionprotocol jobsubmitclient;

The RPC mechanism for Hadoop is based on dynamic proxy implementations. The client code uses the proxy object provided by the RPC class to invoke the server's method. MapReduce defines a series of protocol interfaces for RPC communication. These agreements include: a. Jobsubmissionprotocolb. Refreshusermappingsprotocolc. Refreshauthorizationpolicyprotocold. Adminoperationsprotocole. Intertrackerprotocolf. Taskumbilicalprotocol The first four protocols are used for the client, and the last two protocols are located inside the MapReduce. The Getnewjobid method used here is defined by the protocol Jobsubmissionprotocol: [Java]View Plaincopy

/**
* Allocate a name for the job.
* @return a unique job name for submitting jobs.
* @throws IOException
*/
Public JobID Getnewjobid () throws IOException;

Users use this protocol to submit jobs through Jobtracker, view job status, and so on. 2. Job File UploadJobclient uploads the required files for the job to the Jobtracker file system, usually HDFs, based on the job configuration information. Configuration information is maintained by the Jobconf class object. In the new API, the Jobconf object is part of the Jobcontext object, and the job class job inherits from the Jobcontext class. Before uploading a file, you need to create the necessary directories on HDFs. The exact process of uploading a file begins with the call to this line from the Jobclient.submitjobinternal method: [Java]View Plaincopy

Copyandconfigurefiles (Jobcopy, Submitjobdir);

After the information such as the number of committed replicas (mapred.submit.replication, default 10) is configured, the main code analysis is as follows (some logs and exception handling are omitted for clarity purposes): [Java]View Plaincopy

Retrieve command line arguments placed into the jobconf
by Genericoptionsparser.
String files = job.get ("tmpfiles");
String libjars = Job.get ("Tmpjars");
String archives = job.get ("tmparchives");

First, you get the names and paths of different types of files from the configuration, which are specified from the command line (Hadoop Shell) when the job is submitted. Files represent common files that the job relies on, such as text files; Libjars represents the third-party jar package that the application relies on, and archives is a compressed file that is packaged with multiple files used by the application. [Java]View Plaincopy

// create a number of filenames in the jobtracker ' s fs namespace
Filesystem fs = submitjobdir.getfilesystem (Job);
submitJobDir = fs.makequalified (submitjobdir);
fspermission mapredsysperms = new fspermission (jobsubmissionfiles.job_dir_permission);
filesystem.mkdirs (fs, submitjobdir, mapredsysperms);
path filesdir = jobsubmissionfiles.getjobdistcachefiles (submitjobdir);
path archivesdir = jobsubmissionfiles.getjobdistcachearchives (SubmitJobDir);
path libjarsdir = jobsubmissionfiles.getjobdistcachelibjars (submitJobDir);

Next, a series of file path names are created in the namespace of the Jobtracker file system (typically HDFs), which includes the three file types described above.
With the pathname, create a path on the HDFs and copy the files to the corresponding directory with the following code: [Java]View Plaincopy

Add all command line Files/jars and archive
First copy them to jobtrackers filesystem
if (Files! = null) {
Filesystem.mkdirs (FS, Filesdir, mapredsysperms);
string[] Filearr = Files.split (",");
For (String Tmpfile:filearr) {
URI Tmpuri;
Tmpuri = new URI (tmpfile);
Path TMP = new Path (Tmpuri);
Path NewPath = copyremotefiles (Fs,filesdir, TMP, Job, replication);
URI Pathuri = Getpathuri (NewPath, Tmpuri.getfragment ());
Distributedcache.addcachefile (Pathuri, job);
Distributedcache.createsymlink (Job);
}
}
if (libjars! = null) {
Filesystem.mkdirs (FS, Libjarsdir, mapredsysperms);
string[] Libjarsarr = Libjars.split (",");
For (String Tmpjars:libjarsarr) {
Path TMP = new Path (tmpjars);
Path NewPath = Copyremotefiles (FS, Libjarsdir, TMP, Job, replication);
Distributedcache.addarchivetoclasspath
(New Path (Newpath.touri (). GetPath ()), job, FS);
}
}
if (archives! = null) {
Filesystem.mkdirs (FS, Archivesdir, mapredsysperms);
string[] Archivesarr = Archives.split (",");
For (String Tmparchives:archivesarr) {
URI Tmpuri;
Tmpuri = new URI (tmparchives);
Path TMP = new Path (Tmpuri);
Path NewPath = Copyremotefiles (FS, Archivesdir, TMP, Job, replication);
URI Pathuri = Getpathuri (NewPath, Tmpuri.getfragment ());
Distributedcache.addcachearchive (Pathuri, job);
Distributedcache.createsymlink (Job);
}

Note that the upload and download of the MapReduce job file is done through the Distributedcache tool, which is a data distribution tool. The user-specified files are distributed to the individual tasktracker to run the task. The details of the tool are not covered here, and are reserved for future discussion. Finally, copy the jar file corresponding to the job into HDFs: [Java]View Plaincopy

String Originaljarpath = Job.getjar ();
if (originaljarpath! = null) { //copy jar to Jobtracker ' s FS
//Use jar name if job was not named.
if ("". Equals (Job.getjobname ())) {
Job.setjobname (new Path (Originaljarpath). GetName ());
}
Path submitjarfile = Jobsubmissionfiles.getjobjar (Submitjobdir);
Job.setjar (Submitjarfile.tostring ());
Fs.copyfromlocalfile (new Path (Originaljarpath), submitjarfile);
Fs.setreplication (submitjarfile, replication);
Fs.setpermission (Submitjarfile,
new Fspermission (jobsubmissionfiles.job_file_permission));
}

Note that each time a type of file is uploaded, the path to the file is configured in the Jobconf object, which is done by [Java]View Plaincopy

Distributedcache.addcachefile (Pathuri, job);
Distributedcache.addarchivetoclasspath (New Path (Newpath.touri (). GetPath ()), job, FS);
Distributedcache.addcachearchive (Pathuri, job);
Job.setjar (Submitjarfile.tostring ());

These four lines of code are complete. Incidentally, the path class Hadoop file system abstracts the Paths "3" in the file system on the basis of Java.net.URI. Java's file class and URL classes abstract different things separately, and path can be said to unify them.
3. Generate the Inputsplit fileJobclient calls InputFormat's Getsplits method to generate Inputsplit related information from the input file submitted by the user. [Java]View Plaincopy

Create the splits for the job
FileSystem fs = Submitjobdir.getfilesystem (jobcopy);
int maps = writesplits (context, submitjobdir);
Jobcopy.setnummaptasks (maps);

Jobcopy is a jobconf object. Where the Writesplits method actually calls the Inputsplit.getsplits method to generate splits information and writes splits raw information and meta information to the corresponding directory and file in HDFs. The generation process for split will be studied later, and this will not unfold. Finally, the Jobconf object corresponding to the job is written in the form of an XML configuration file into HDFs: [Java]View Plaincopy

Write job file to Jobtracker ' s FS
Fsdataoutputstream out =
Filesystem.create (FS, Submitjobfile,
new Fspermission (jobsubmissionfiles.job_file_permission));
try {
Jobcopy.writexml (out);
} finally {
Out.close ();
}

At this point, the job file upload is officially completed.
Next, the job will be submitted to Jobtracker, please follow the following article: operation initialization of MapReduce scheduling and execution principle

MapReduce scheduling and execution Principles series articles

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MapReduce scheduling and execution Principles series articles

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

MapReduce scheduling and execution Principles series articles

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support