Hadoop jobs reference third-party jar files

Source: Internet
Author: User

Write mapreduce programs in eclipse and reference third-party jar files. You can use the eclipse Hadoop plug-in to directly run on hadoop for submission, which is very convenient. however, the plug-in version must match eclipse. Otherwise, it will always be executed locally. In 50070, no job is generated.

If you want to release the program into a jar file and execute it through the command line on namenode, without eclipse to help automatically configure the jar file, you will encounter java. lang. classNotFoundException, which can be discussed in two cases.

I. How do I execute hadoop commands?

In fact, $ HADOOP_HOME/bin/hadoop is a script file. The following wordcount command is used as an example.

bin/hadoop jar wordcount.jar myorg.WordCount /usr/wordcount/input /usr/wordcount/output

The script file parses parameters, configures the class path, and finally executes the following command:

exec java -classpath $CLASSPATH org.apache.hadoop.util.RunJar $@

Where$ CLASSPATH: Include$ {HADOOP_CONF_DIR}, $ HADOOP_HOME*. Jar and$ HADOOP_CLASSPATH;

  • $ @: All script parameters. The parameters following jar are shown here;
  • RunJar: the function of this class is relatively simple. Extract the jar file"Hadoop. tmp. dir"Directory, and then execute the class we specified.Myorg. WordCount

For a complete analysis of p.s. hadoop scripts, see <Hadoop job submission analysis>.

After RunJar executes WordCount, it enters our program. You need to configure mapper, reducer, and output path, and finally submit this job to JobTracker by executing job. waitForCompletion (true.

Currently, local execution has been completed. If ClassNotFoundException occurs during this period, you can configure $ HADOOP_CLASSPATH in your own script file, including the required third-party jar files, run the hadoop command again. This is case 1.

 

2. How does JobTracker and TaskTracker obtain third-party jar files?

Sometimes, after a job is submitted, ClassNotFoundException is also generated in the map or reduce function. this is because map or reduce may be executed on other machines. Those machines do not need jar files. mapreduce jobs are executed by JobTracker and TaskTracker. How can they obtain third-party jar files? This is case 2.

First, analyze the mapreduce submission process, as shown in.

Step 1. and 2. Submit a Job through the Job class, get a Job number, and decide whether to submit the Job to LocalJobRunner or JobTracker according to the conf.

Step 3. copy job resource

The client uploads the resources required by the job to hdfs, such as job split and jar files. JobClient processes jar files through the configureCommandLineOptions function. In this method, the parameters are obtained through the job.

Files = job. get ("tmpfiles"); // corresponding parameter item-fileslibjars = job. get ("tmpjars"); // corresponds to-libjarsarchives = job. get ("tmparchives"); // corresponds to-archives

If the jar file is configured, add it to the distributed cache DistributedCache.-libjars is used as an example:

if (libjars != null) {    FileSystem.mkdirs(fs, libjarsDir, mapredSysPerms);    String[] libjarsArr = libjars.split(",");    for (String tmpjars: libjarsArr) {        Path tmp = new Path(tmpjars);        Path newPath = copyRemoteFiles(fs, libjarsDir, tmp, job, replication);        DistributedCache.addArchiveToClassPath(newPath, job);    }}

In addition, job is always required in mapreduce program configuration. setJarByClass to specify the running class, so that hadoop can locate the jar file based on the class, that is, the packaged jar, and upload it to hdfs. the jobClient completes the resource replication process. These resources can be used by JobTracker and TaskTracker.

Step4-10. JobClient submit the job and execute the job (JobTracker and TaskTracker work is not carried out, see <Map-Reduce Process Analysis> ).

 

Iii. Summary

To allow mapreduce programs to reference third-party jar files, use the following method:
  1. Use command line parameters to pass jar files, such as-libjars;
  2. Directly set in conf, such as conf. set ("tmpjars", *. jar). jar files are separated by commas;
  3. Use distributed cache, such as DistributedCache. addArchiveToClassPath (path, job). The path here must be hdfs, that is, upload the jar file to hdfs, and then add the path to the distributed cache;
  4. Third-party jar files and their own programs are packaged into a jar file. The program obtains the entire file through job. getJar () and transmits it to hdfs. (very bulky)
  5. Add the jar file to the $ HADOOP_HOME/lib directory of each machine (not recommended)

P.s. If you use the method 1. or 2. above, you need to pay attention to the Configuration problem. You need to use the getConf () function instead of creating an object by yourself.

How does Hadoop submit multiple third-party jar packages?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.