Hadoop job submission analysis (1)

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://www.cnblogs.com/spork/archive/2010/04/07/1706162.html

   Bin/Hadoop jar xxx.Jar mainclass ARGs
......

This command has already called NN. After you write a project or modify the project, you must create a jar package, then, use the above command to submit it to the hadoop cluster for running. This is extremely cumbersome in the development phase.ProgramThe employee is "lazy". Since it is troublesome, it must be necessary to reduce unnecessary keyboard percussion and extend the keyboard life. For example, some people write shell scripts to automatically compile, package, and submit them to hadoop. But it is still a little troublesome. The most convenient method is to use hadoop eclipse plugin to browse and manage HDFS and automatically create the Mr program template file. The best thing is to run on hadoop directly, however, the version cannot keep up with the master version of hadoop. The current Mr template is 0.19. Another software called hadoop studio seems to be quite powerful, but I have never tried it. I will not comment on it here. So how do they use the above command to submit jobs? Don't know? It doesn't matter. If you are open-source, you can directly look at the source code analysis. This is the biggest advantage of open-source software.

First, we will analyze the shell script bin/hadoop to see what the script has done internally and how to submit hadoop jobs.

Because it is a Java program, this script will eventually call Java to run, so the most important thing about this script is to add some pre-parameters, such as classpath. Therefore, we directly jump to the last line of the script to see if it has added those parameters, then analyze them one by one (this article ignores the analysis of environment parameter loading, Java search, and cygwin processing in the script ).

   code highlighting produced by actipro codehighlighter (freeware) 
 http://www.CodeHighlighter.com/
 --> #    RUN   it 
 exec  "   $ Java  "    $   java_heap_max  $   hadoop_opts-classpath  "   $ classpath  "    $   class  "  $ @  "

From the above command, we can see that this script has finally added the following important parameters: java_heap_max, hadoop_opts, classpath, class. The following is an analysis (this article is based on cloudera hadoo P 0.20.1 + 152 analysis ).

The first is java_heap_max, which is relatively simple and mainly involvesCodeAs follows:

  Java_heap_max  =  -Xmx1000m
  #  Check envvars which might override default ARGs
  If  [  " $ Hadoop_heapsize  "  !  =     ""  ]  ;     Then  
  #  Echo     "  Run with heapsize $ hadoop_heapsize  "  
Java_heap_max =  "  -Xmx  ""  $ Hadoop_heapsize  ""  M  "  
  #  Echo     $  Java_heap_max
Fi

First, assign the default value-xmx 1000 m And then check if hadoop_heapsize is set and exported in the hadoop-env.sh, if so, overwrite it to get the final java_heap_max.

The next step is to analyze classpath, which is one of the key points of this script. This section adds the corresponding dependent libraries and configuration files to the classpath.

  #  First, use the hadoop configuration file directory to initialize classpath.
Classpath  =  "  $ {Hadoop_conf_dir}  "  
......
  # For the hadoop release, add the hadoop core jar package and webapps to classpath.
  If  [-D  "  $ Hadoop_home/webapps  "  ]  ;     Then  
Classpath  = $  {Classpath }:  $  Hadoop_home
Fi
  For F In  $  Hadoop_home  /  Hadoop-*-core  .  Jar  ;     Do  
Classpath  = $  {Classpath }:  $  F  ;  
Done
  # Add the jar package in libs
  For  F In  $  Hadoop_home  /  Lib  /  *  .  Jar  ;     Do  
Classpath  = $  {Classpath }:  $ F  ;  
Done
  For  F In  $  Hadoop_home  /  Lib  /  JSP-  2.1  /  *  .  Jar  ;    Do  
Classpath  = $  {Classpath }:  $  F  ;  
Done
  #  The following tool_path is added to classpath only when the command is "ARCHIVE ".
  For  F In  $  Hadoop_home  /  Hadoop-*-Tools  . Jar  ;     Do  
Tool_path  = $  {Tool_path }:  $  F  ;  
Done
  For  F In  $  Hadoop_home  /  Build  / Hadoop-*-Tools  .  Jar  ;     Do  
Tool_path  = $  {Tool_path }:  $  F  ;  
Done
  #  Finally, add your custom hadoop classpath.
  If  [ "  $ Hadoop_classpath  "  !  =     ""  ]  ;     Then  
Classpath  = $  {Classpath }:  $  {Hadoop_classpath}
Fi

The above analysis is only a part. Due to the long code, the classpath for the developer section is not listed.

The following is the focus and entity of this script: Class Analysis. The shell script sets the class and hadoop_opts according to the command parameters you enter. The class points to the class which is the entity that actually executes your command.

  #  Figure out which class  Run  
  If  [  "  $ Command  "     =     " Namenode  "  ]  ;     Then  
Class  =  'Org  .  Apache  .  Hadoop  .  HDFS  .  Server  . Namenode  .  Namenode'
Hadoop_opts  =  "  $ Hadoop_opts $ hadoop_namenode_opts  "  
......
Elif [  "  $ Command  "     =     "  FS  " ]  ;     Then  
Class  =  Org  .  Apache  .  Hadoop  .  FS  .  Fsshell
Hadoop_opts  =  " $ Hadoop_opts $ hadoop_client_opts  "  
......
Elif [  "  $ Command  "     =     "  Jar  "  ]  ;     Then  
Class  = Org  .  Apache  .  Hadoop  .  Util  .  Runjar
......
Elif [  "  $ Command  "     =     "  Archive  " ]  ;     Then  
Class  =  Org  .  Apache  .  Hadoop  .  Tools  .  Hadooparchives
Classpath  = $  {Classpath }:  $ {Tool_path}
Hadoop_opts  =  "  $ Hadoop_opts $ hadoop_client_opts  "  
......
  Else  
Class  = $  Command  
Fi

Here we are concerned about the corresponding class Org. apache. hadoop. util. runjar, this class and so on, we continue to analyze, this is our next intersection to the final goal.

The script also sets hadoop. log. dir, hadoop. log. file, and other hadoop_opts. Then, use the exec command to submit the task with the preceding parameters.

Through the above analysis, we know that if you want to replace this script, then, you must add the libraries and configuration file directories that hadoop depends on to the classpath at least (java_heap_max and hadoop_opts are not required), and then call Org. apache. hadoop. util. runjar class to submit jar to hadoop.

PS:Not familiar with bash shell can look at this http://learn.akae.cn/media/ch31s05.html first

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop job submission analysis (1)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support