Hadoop job submission analysis (4)

Source: Internet
Author: User

Http://www.cnblogs.com/spork/archive/2010/04/21/1717552.html

The previous analysis is only a prelude to Hadoop job submission. The actual job submission code is in the main of the MR program. RunJar will dynamically call this main at the end. In (2). What we need to do below is to go further than RunJar so that job submission can be implemented during encoding, just like Hadoop Eclipse Plugin, you can directly Run on Hadoop for MR classes containing Mapper and Reducer.

Generally, each MR program has such a similar job commit code. Here we use the WordCount example:

     Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

The first thing to do is to build a Configuration object and parse the parameters. Then construct the Job object for Job submission, and set the Job Jar package, the Mapper and CER classes, the input and output Key and Value classes, and the input and output paths of the Job, finally, submit the job and wait for the job to finish. These are just basic configuration parameters. In fact, more configuration parameters are supported. We will not describe them here. For details, refer to the API documentation.

Generally, the analysis code is analyzed step by step from the beginning, but our focus is to analyze what happened during the submission process. Here we ignore the impact of the previous settings on subsequent jobs, we directly jump to the Job submission step for analysis. When a problem occurs, we need to analyze the previous code.

When job. waitForCompletion is called, The submit method is called internally to submit the job. If the input parameter is true, the job operation information is printed in time; otherwise, the job is waiting for completion. After the submit method is added, there is another layer in which the submitJobInternal of the jobClient object inside the job object is used to submit the job. The main task is to start from this method. First, get jobId and use the jobSubmitClient object. The class corresponding to jobSubmitClient is one of the implementations of JobSubmissionProtocol (currently there are two implementations, JobTracker and LocalJobRunner ), it can be determined that the class corresponding to jobSubmitClient is either JobTracker or LocalJobRunner. Er, this is a bit of an idea. Can I submit a job to JobTracker or execute it locally? The possible reason is to check which class instance the jobSunmitClient obtains during initialization. Let's take a look at it later. You will find that submitJobInternal uses jobSubmitClient. submitJob (jobId) to submit the job. Let's take a look at the submitJob Implementation of JobTracker and LocalJobRunner. It seems that this is indeed the case. Well, let's jump back and see how the jobSubmitClient is initialized. In the init of JobClient, we can find the initialization Statement of jobSubmitClient:

     String tracker = conf.get("mapred.job.tracker", "local");
if ("local".equals(tracker)) {
this.jobSubmitClient = new LocalJobRunner(conf);
} else {
this.jobSubmitClient = createRPCProxy(JobTracker.getAddress(conf), conf);
}

Ha, it depends on the mapred. job. tracker attribute in conf. If you do not set it, the default value is local, and jobSubmitClient will be assigned to the LocalJobRunner instance. Normally, we only reference libraries in lib during development and do not reference the configuration files in the conf folder. Here we can explain why we directly Run as Java Application, the job is submitted to Local to run, instead of Hadoop Cluster. Can we Run on Hadoop by adding the conf folder to classpath? The conclusion is still early. We will continue the analysis (after you have added the conf folder, you can submit it for a try, and an obvious error will pop up to let you know what is worse, here I will sell the seller first ).

After jobId is obtained, add jobId to SystemDir to construct the submitJobDir directory for Job submission. SystemDir is obtained by the getSystemDir method of JobClient. This SystemDir is very important when constructing the fs object, determined the type of the returned fs. The configureCommandLineOptions method is to upload the third-party library or file that the job depends on to fs, perform classpath ing or Symlink, and set some parameters, I will not analyze it carefully here. We mainly care about the following two aspects:

 FileSystem fs = getFs();

It looks very simple. In a word, it is to get the FileSystem instance, but it is actually a little dizzy. Because Hadoop abstracts the file system, the type of the fs Instance obtained here determines whether you operate on hdfs or local fs. Okay. Let's see it.

  public synchronized FileSystem getFs() throws IOException {
if (this.fs == null) {
Path sysDir = getSystemDir();
this.fs = sysDir.getFileSystem(getConf());
}
return fs;
}

See it. fs is returned by getFileSystem of sysDir. We will try again. Due to the length, we will only list the statements mainly involved.

     FileSystem.get(this.toUri(), conf);

CACHE.get(uri, conf);

fs = createFileSystem(uri, conf);

Class<?> clazz = conf.getClass("fs." + uri.getScheme() + ".impl", null);
if (clazz == null) {
throw new IOException("No FileSystem for scheme: " + uri.getScheme());
}
FileSystem fs = (FileSystem)ReflectionUtils.newInstance(clazz, conf);
fs.initialize(uri, conf);
return fs;

It is related to conf again. It seems that conf has to be followed in real time. Java reflection technology is used to dynamically generate corresponding class instances. The class acquisition is closely related to uri. getScheme, and uri is constructed based on sysDir. The value of sysDir is ultimately determined by the jobSubmitClient instance. If jobSubmitClient is an instance of JobTracker, Scheme is hdfs. If it is an instance of LocalJobRunner, It is file. From core-default.xml you can find the following information:

 <property>
<name>fs.file.impl</name>
<value>org.apache.hadoop.fs.LocalFileSystem</value>
<description>The FileSystem for file: uris.</description>
</property>

 

<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>

Therefore, in the previous Job submission Code, many things have been decided during Job instance initialization, which is determined by the configuration file in the conf folder. Configuration loads class and resource files through the class loader in the context of the current thread. To Run on Hadoop, the first step must include the Conf folder in the search path of the Configuration classloader, that is, the classloader in the context of the current thread.

The second thing to note is:

     String originalJarPath = job.getJar();

if (originalJarPath != null) { // copy jar to JobTracker's fs

// use jar name if job is not named.
if ("".equals(job.getJobName())){
job.setJobName(new Path(originalJarPath).getName());
}
job.setJar(submitJarFile.toString());
fs.copyFromLocalFile(new Path(originalJarPath), submitJarFile);
fs.setReplication(submitJarFile, replication);
fs.setPermission(submitJarFile, new FsPermission(JOB_FILE_PERMISSION));
} else {
LOG.warn("No job jar file set. User classes may not be found. "+
"See JobConf(Class) or JobConf#setJar(String).");
}

Because the client needs to package the job into a jar when submitting a job to Hadoop, and then copy it to the submitJarFile path of fs. If you want to Run on Hadoop, you must pack the class file of the job into a jar package and then submit it. In Eclipse, this is easier. Here we assume that you have enabled the automatic compilation function. We can add a piece of code at the beginning of the Code to package the class file in the bin folder as a jar package, and then perform subsequent general operations.

After the configureCommandLineOptions method, submitJobInternal checks whether the Output Folder already exists. If yes, an exception is thrown. Then, the job data is divided and the number of map tasks is obtained based on the number of splits. Finally, write the job configuration file to submitJobFile and call jobSubmitClient. submitJob (jobId) to submit the job.

So far, the analysis of Hadoop job submission is almost the same. In some cases, it is too long-winded, and in some cases, it is too short to end, however, the general process and some important things are still clear, which is actually the case. In this article, we will add some features to support Run on Hadoop based on the previous jobUtil. In fact, we will mainly add a Jar package method.

To be continued...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.